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THESIS ABSTRACT 

Analysis of standardized reading comprehension tests 

Reading comprehension skill builders and tests seem to reflect 
different concepts of comprehension. A review of the history and 
development of reading comprehension tests, and the research in 
comprehension revealed that no concise or agreed upon definition of 
reading comprehension exists. v 

Despite the lack of clear definition, tests of reading compre- 
hension are frequently used in evaluating the performance of pupils 
and teachers as well as the effectiveness of instructional materials 
and methods . 

This study was designed to investigate what reading comprehension 
tests evaluate, i.e. what pupils must do or know to perform well on 
selected standardized reading comprehension tests. Standardized 
reading comprehension tests were selected at three levels (Grade 1-2, 
4-6, 9-14) and from three batteries (California Achievement Test . 
Gates-MacGinitie Reading Test and Stanford Achievement Test) . 

Two types of analyses were conducted. The first analysis was a 
study of the readability of reading comprehension test items. Two 
widely-used readability formulae were employed — Dale-Chall (1948) 
and Spache (1953). The second analysis was a study of tasks required 
by reading comprehension test items. The measures of task were designed 

-vii- 
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for this study and included a rating scale for the reading selections, 
a rating scale for the questions and a rating scale for the choices. 
Distinct and differing characteristics emerged for both readability 
scores and task ratings among the three test levels and the three 
test batteries analyzed. 

Tests were found to differ on essentially all readability counts, 

e.g. One test had an average selection length of over 400 words while 

another at the same test level but from a different test battery had 

an average of less than 60 words. However, complementary relationships 

\ 

seemed to exist, e.g. while one test had long reading selections and 
short questions, another test had short selections and long questions. 
Also, readability scores consistently increased with higher test 
level. For example, reading selections, questions and choices were 
usually longer and had more hard words at higher test levels. 

The task analysis revealed that different test batteries con- 
tained somewhat similar types of reading selections, differed con- 
siderably on types of questions and had somewhat similar distractors. 

At lower test levels selections were generally about common incidents 
and people. At higher test levels, selections were more about academic 
subjects such as science or social studies. Test questions were of 
two major types: paraphrase and concept. Paraphrase questions 

included eight kinds of restatements of given information, e.g. 
contextual paraphrase, grammatical paraphrase. Concept questions 
included six categories and always applied when all the information 
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was not given in the reading selection, e.g. probable concept, 
language concept, previous knowledge of science. 

Whereas selected reading comprehension tests were found to differ 
in what they were testing, they appeared to be testing abilities 
similar to those evaluated by I.Q. tests and achievement tests in 
other school subjects. 

The findings of the analyses suggested a model for new and better 
defined reading comprehension tests. Such tests would include "cri- 
teria" and descriptions of the following five features: 

1. length, sentence length, and hard word ratio of 
reading selections, questions and choices 

2. topics of reading selections 

3. tasks necessary for supplying the correct answer 
to the question 

4. types of distractors provided as alternate answers. 



-ix- 
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CHAPTER I 



The Problem 



Any child who fails to acquire the 
ability to read has been denied a 
right — a right as fundamental as 
the right to life, liberty, and the 
pursuit of happiness — the right to 
read. . . . 

It is inexcusable that in this day 
when man has achieved such giant 
steps in the development of his 
potential, when many of his accom- 
plishments approach the miraculous, 
there still should be those who do 
not learn to read.... 

Therefore, as U.S. Commissioner of 
Education, I am herewith proclaiming 
my belief that we should immediately 
set for ourselves the goal of 
assuring that by the end of the 
1970's the right to read shall be a 
reality for all — that no one shall 
be leaving our schools without the 
skill and the desire necessary to 
read to the full limits of his 
capability. 

James E. Allen, Jr. 
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Introduction 



The ultimate goal of reading instruction is to develop 
reading comprehension (Chall, 1967, p. 307). Unfortunately, 
there is as yet no concise or agreed upon definition of what 
comprises reading comprehension. Consequently, no consistent 
means exist of either teaching or testing it. 

This thesis illustrates the problem by demonstrating the 
existence of major differences among a sample of reading 
comprehension skill builders and among selected reading 
comprehension tests. 

An attempt to find a more consistent definition of reading 
comprehension has been undertaken here by systematically 
analyzing standardized reading comprehension tests. These 
empirically constructed tests have long been the accepted 
criterion for establishing success or failure in reading 
comprehension. A clarification of what tests are actually 
testing should contribute to a clarification of what is 
currently meant by reading comprehension. 
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Current Trends in Teaching Reading Comprehension 

An analysis of current trends in teaching reading comprehension 
will demonstrate some of the confusion that exists in defining the 
concept. 

Typical materials used to. teach comprehension are collections of 
reading selections (either in basal readers with accompanying work- 
books, or booklets, or boxed packages called "reading laboratories"),^ 

sample of widely used basal readers with workbooks, booklets 
and laboratories that teach comprehension are: 



\ 

Title 


Publisher and Date 


Reading 
Grade Level 




Basal Readers with Workbooks 


Basic Readers 


Ginn 


1969 


Pre-primer - 


6 


Basic Reading 


Lippincott 


1964 


Pre-primer - 


8 


Basic Reading Program 


Harper & Row 


1966 


Pre-primer - 


6 


Macmillan Reading Program 


Macmillan 


1966 


Pre-primer - 


6 


New Basic Readers 


Allyn & Bacon 


1968 


Pre-primer - 


6 


Open Court Basic Readers 


Open Court 


1967 


Pre-primer - 


6 


Sheldon Basic Reading Series 


Allyn & Bacon 


1968 


Pre-primer - 


8 


Booklets 


Be a Better Reader 


Prentice-Hall 


1968 


4-12 




Macmillan Reading Spectrum 


Macmillan 


1964 


4-6 




McCall-Crabbs Standard Test 
Lessons in Reading 


Teachers College 


1961 


2-12 




New Practice Readers 


McGraw Hill 


1961 


2-8 




Read for Meaning 


Lippincott 


1955 


4-12 




Readers Digest Skill Builders 


Readers Digest 


1963 


3,-8 




Reading Exercises 


Teachers College 


1965 


2-6 




Reading for Concepts 


McGraw-Hill 


1970 


1-6 




Specific Skill Series 


Bamell Loft 


1967 


1-6 




Laboratories 


Reading Attainments System 


Grolier 


1967 


3,4 (easy reading 


Reading Laboratory 
Reading for Understanding 


Science Research 
Associates 

It It If 


1961 

1958 


intended for older 
pupils and adults) 

2-7 

5 - college 
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The selections are generally graded in difficulty by field-testing 
them on children of various grades, by asking expert opinion, or by 
one of the widely used readability formulae such as the Dale-Chall 
(1948) , Flesch (1948) , or Spache (1953) . Comprehension questions 
follow the selections. These are usually multiple-choice or 
completion questions that ask the student to identify or relate the 
"main idea," "facts," or ."inferences. " 

Table 1 summarizes some aspects of five randomly selected skill 
builders in booklet form. Generally the information in Table 1 was 
taken directly from teachers' manuals, although in some cases the 
"topics" and "questions" were not explicitly stated in the manual. 
"Topics" and "questions" were then established by reviewing the skill 
builders themselves. 

The differences in the structure and content of these skill 
builders appear to reflect the differences in the authors' concep- 
tions of reading comprehension. For example, in "purpose" (see 
Table 1) New Practice Readers proposes to develop seven "elements in 
comprehension," while Reading Exercises proposes to develop speed, 
general comprehension and three "specialized skills." In "questions," 
Standard Test Lessons in Reading has only multiple-choice questions, 
while Be A Better Reader has primarily open-ended questions (questions 
for which no answer choices are provided) . 

Four major related issues emerge from an analysis of the pur- 
poses of reading comprehension skill builders. First, skill builders 
use different instructional strategies for teaching comprehension. 
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Comparison of Five Comprehension Skill Builder Booklets 
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Xorda and phrases such as "recalling, recognizing, understanding, finding, ability to # competence in," etc. were deleted from the manuals * 
description of skills (ab6ve) because they 3L*eraed to be used interchangeably and thus obfuscated rather than clarified the skill descriptions 



Some skill builders concentrate on exercising testing procedure. 

For example, Standard Test Lessons in Reading emphasizes increasing 
reading speed and the number of correct answers by providing practice 
in test-like exercises. Other skill builders seem to emphasize 
increasing the readers' background knowledge. For example, Reading 
for Concepts provides the reader with organized information about 
selected disciplines. The Standard Test Lessons in Reading approach 
suggests that comprehension is improved most by continuous practice 
in answering certain types of questions (similar to those on tests) , 
while the latter approach suggests that comprehension is improved 
most by giving the student specific types of information or subject 
matter knowledge. 

Second, some skill builders concentrate on general topics, e.g. 
Standard Test Lessons in Reading includes selections about animals, 
city life, plants, people, etc. in no apparent sequence or proportion, 
thus implying that comprehension "skills" are general to many types 
of reading matter. Other skill builders carefully differentiate among 
topics, e.g. Be a Better Reader presents exercises in 4 disciplines: 
social studies, science, new math, literature, and in so doing implies 
that comprehension "skills" are different for different subjects or 
disciplines. 

Third, some skill builders isolate specific reading comprehension 
"skills," such as Reading Exercises , which presents special booklets 
and exercises for identifying "details," for finding the "main idea," 
and for "following directions." This suggests that comprehension is 



7 



composed of numerous separate subskills. Other skill builders combine 
many skills, e.g. Standard Test Lessons in Reading mixes questions 
about "stated facts," "inferences" and "main thoughts” into one exer- 
cise and booklet in no apparent sequence or proportion. This approach 
suggests that comprehension is more of a general skill than a 
combination of clearly defined subskills. 

Fourth, some materials have recently become available, usually 
for fourth grade and higher levels, that attempt to give instruction 
to the student in how to go about answering certain types of questions. 
For example. Be a Better Reader (1968, p. 4) instructs pupils to know 
"who the people in the story ara... where they are, what they did, 
and what happened to them" from words directly stated in the reading 
selection. Such instruction is intended to help pupils understand 
facts.l 

Most materials do not provide adequate measures to determine and 

2 

treat types of reading difficulties. If a pupil consistently answers 
questions incorrectly there is usually no suggestion given to either 
pupil or teacher other than continued exercise of the same kind. 

Indeed, the basic assumption underlying most instructional materials 
and methods is that comprehension can be induced by raising, or 

^Some inadequacies of this technique of teaching reading compre- 
hension are presented in Simons (1970), Chapter 1. 

2 

Reading difficulties as demonstrated by errors on these and 
similar reading exercises may be due to a misunderstanding or mis- 
interpretation of the selection or question (Thorndike, 1914) . Errors 
may also be due to a deficiency in word recognition (Thorndike, 1915; 
Chall, 1958a). Since exercises are read silently and questions are 
answered independently the source of error usually remains undetermined. 
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lowering in the case of problem learners, the readability level of 
the reading matter presented. 

The fundamental issue here is whether reading comprehension is 
an analyzable skill that can be reliably diagnosed and directly taught; 
or whether it is an unanalyzable skill that can merely be exercised 
in a general way. 

The lack of agreement within the category of purpose of skill 
building exercises, as described above, also exists within the other 
categories listed in Table 1. The "skill" category of Table 1 con- 
tains only vague descriptions given in the skill builders' teachers' 

v" 

manuals. Language used to describe the skills in Table 1 for different 
skill builders may be identical, e.g., both Reading Exercises and 
Reading for Concepts list as one of their skills "main idea." Unfor- 
tunately, it is not clear that the corresponding tasks are, indeed, 
identical. In addition, the lists of skills seem to confuse 
instructional procedures and psychological processes with comprehen- 
sion skills.* - For instance, while "interpreting" may be a comprehen- 
sion skill that can be developed through instruction and exercise, 
"inference" may be more of a psychological process that can not be 
readily modified. And, "finding the main idea," "locating details," 
and "following directions" generally seem to be instructional "sets" 
or procedures used by teachers or authors of skill builders to 
exercise comprehension. While exercise materials generally represent 
instructional strategies and procedures, their relationship to 

^Simons, op . cit . 
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psychological processes and comprehension skills remains enigmatic. 

It appears that materials designed to teach reading comprehension 
are inconsistent in their underlying hypotheses about the nature of 
comprehension and in the types of "lessons" designed to exercise it. 

In addition, these materials generally do not afford the teacher or 
pupil an analysis of strengths and weaknesses, nor do they usually pro- 
vide instructional procedures other than the selection-question 
exercise. Consequently, most materials designed to teach reading 
comprehension actually are tests of comprehension arranged by 
successive levels of difficulty. 

\ 

Current Trends in Testing Reading Comprehension 

The issues and problems of instructional materials in comprehen- 
sion also apply to tests of reading comprehension. Unfortunately, 
the problems involved in testing comprehension are even more serious 
since tests represent the criteria of competence in comprehension. 

In other words, comprehension is generally defined by what is tested 
on tests of reading comprehension. 

The results of these tests have considerable educational and 
social significance. Reading comprehension test scores are used for 
accepting pupils into schools, putting pupils into special classes, 
grouping pupils within classes, determining promotion, acceleration or 
demotion, presenting academic awards, counseling for future education 
and in some cases setting teacher expectation. Furthermore, new edu- 
cational materials as well as many government and industry ^sponsored 
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educational programs at? well as teachers and methods are evaluated 

mainly on the results of these tests. Annually many millons of dollars 

from the educational budget are appropriated solely for the purchase 

1 

and scoring of standardized tests. 

< 

Standardized tests are almost universally used in schools to assess 
pupils ' reading ability (Stevens, 1971). These tests are generally con- 
structed on the basis of three assumptions. The assumptions are that 
older children and children with higher I.Q.s perform better; and that 
performance in comprehension follows the "normal distribution" model. 

The "normal distribution" model generally assumes that performance in 
reading comprehension is superior in 4% of the population of students 
at each grade or test level, above average in another 19% of the 
population, average in 54% of the population, below average in 39% of 
the population and poor in another 4% of the population (Kelley, et al , 
1966, p. 10). 

Thus, a large number of experimental test items are administered 

to many pupils at many grade levels. Either all or a sample of these 

2 

pupils are also given I.Q. tests. Test items that are found to dis- 
criminate empirically, for whatever reason, among older and younger 

^The cost of the first state-wide standardized achievement testing 
program in Massachusetts, which included about 100,000 fourth graders, 
was $120,000.00 (Cohen, 1971, p. 1,5). 

2 

For example, pupils tested for the California Achievement Tests 
were also given the California Test of Mental Maturity (California Test 
Bureau, 1957, p. 18-20); some pupils given the Gates-MacGinitie Reading 
Tests were also given the Lorge-Thorndike Intelligence Teats (Gates 
and MacGinitie, 1969, p. 1-2); and pupils given the Stanford Achievement 
Test were also given the Otis Quick-Scoring Mental Ability Test (Kelley, 
et al, 1966, p. 9) . 
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pupils and among pupils with high and low I.Q.s, as well as items that 
have acceptable empirical difficulty scores are considered for inclu- 
sion in the final form of the tests. The empirical difficulty score 
of an item is the proportion of pupils in a given population that 
answered the item correctly. Items are included in tests so that the 
combination of difficulty scores forms a "normal distribution." For 
example, a relatively small proportion of items passed empirically, for 
whatever reason, by a small proportion of pupils is chosen. These 
items are considered difficult and probably make up about 23% of the 
items in the test. A relatively large-, proportion of items, 54%, is 
considered of average difficulty and includes items that were passed 
by approximately 70-80% of the population, and so on, until a "normal 
distribution" appears. 

After the test items are chosen standardization procedure requires 
selecting large student samples representative of the national school 
population (Kelley, et al, 1966, p. 9). Representativeness is usually 
based on census data and includes such population characteristics as 
geographic distribution, community or school size, median family income, 
median number of years of schooling completed by those over 25 years 
of age, chronological age by grade, and mental ability of the group 
(California Test Bureau, 1957, p. 12; Gates and MacGinitie, 1965d, p. 2; 
Kelley, et al., 1966, p. 9-10.) Tests are uniformly administered to 
pupils in this population. Norms for test scores at given grade 
levels are calculated. Common scores are grade scores, percentiles 
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and stanines (Kelley, e* al , 1966, p. 10) 1 

Consequently, standardized test scores provide only a rank of 

pupil performance on a given task in relation to the standardization 

population. No matter how literate a population is, the lower 23% 

*> 

always scores below "grade level. Furthermore, no one seems to 
know why one item is harder than another, what a pupil must do or know 
to answer the question and so on. Generally standardization procedures 
seem to receive a disproportionate amount of description and discus- 
sion in test manuals at the expense of more qualitative aspects of 
the test6. 

V'* 

. . Table 2 summarizes several "qualitative" aspects of three 

standardized reading comprehension tests. Information for Table 2 was 

taken from teachers' manuals and technical bulletins that accompanied 

3 

the tests. Differences among tests are apparent. Again, the dif- 
ferences in the structure and content of comprehension tests reflect 
the differences in their authors' conceptions of reading comprehension. 

As noted on Table 2, purposes and skills for tests of comprehension 
seem even more vague than those of instructional materials (on Table 1) 
e.g. , skill builders enumerated the skills they included in comprehension: 



*Dr. Henry Dyer (1971, p. 19) has attacked these scores calling them 
"psychological and statistical monstrosities" because they are so easily 
and so frequently misinterpreted. Grade scores for example vary from 
test to test. One test may put a pupil's reading performance at 4th grade 
while another test will rate the pupil at 5th grade. Sometimes 2 or 3 
wrong answers change a grade score by one year. Population samples were 
also criticized for often not being representative of the nation. 



2 

Kelley, et al, (1966, p. 29) stated that the 1964 Stanford Achieve- 
ment Test, for instance, yields "harder norms" than the 1940 or 1953 editions 
of the test. For example, in 1940, 4th grade level corresponded to a test 
score of 12. In 1964, 4th grade level corresponded to a test score of 18. 

For a penetrating analysis of how tests differ see Kerfoot (1965). 



23 



I 



13 



.o 

» 

CK 



N p 
P 

* 



m - 

2 a 

s § 



P p Qj 

« X t*) 
p. 



OH -o 

x p. q 

s §5 

d o p 



: £ § 



•o 

» 



nJ 

a 

T3 

§ 



a 

« 



£ 

& 



*) 

* 

fe 

a . 

■o * 

g •* 
« to 
_ 

JP o 

,*ON 



85 

«S 

|S 

fa 



TJ 

UN 

# 



a 

o 

ofl 

» 

a] 

m 

© 

P 

A 



V* 

is 

a 1- * 

■» 

2 •> 



*£ © 
x: 

rt‘ 73 

feS 



n x 

IE 

2S 



n 

5 

cO P 

x rt o) 
p x: u> 
« p © 
3 tn 
HU’i) 

5°i 

© 

4> t»j a 
to rc x: 
© W 43 
,£> (0 
<TJ C 
P P.-H 
<TJ 

£Z O C 
P -C O 

+3 -«H 

£mS 

3 O CD 
P 3 
o Mcr 
■H C 
X.-H r| 

c 

ffl 03 CO 

x: o «. 
PC© 
iC 

■V © tO 

h J= *3 



is 



■ » .— N 

_ I n o . 
o .p ^ 
.•H * . 

^o o °* 

U\ g <A 

' P ° ^ O 



m S ,-i 
a o Re 
eg ® S 

o S - c o 

fl ° v n) 

;j x: o y 

•OP ** *tf 

§f g § 

CO _ £ CO 

C 1 .5 P o 

P P V? P 

* Tl S *8 



©O'' 

•u 



t 



fl 

5 



b0 

a 

3 

rt 



s 

-3 



.xs 

X 

(0 



- *P aJ 
rH © _^ 
«H .p sQ 
CX O O 
D »H 

a « 



© *a p 

U H o 

3 © 

CO -C * 
<fl © S 
a x to 
6 



o u « . 

g o'- 

■HP® 

P 10 

c & -3 

3 P O ^ 
«H rH U\^ 

• ■HO) ' 

• .OP • 

• aj T> CX 



->Vf\ 
rHfO 
tC Xs 



rH 

• ® 

>N 

® P 

a • 
© •> 
i*3 X, 
^ 2 
(NJ -3 
•H X 

* <c 

rH O 



® C 

S & _ 
.U w\ 

* ® u n • 

rH U O -P O. 
4 K +* U. 

•H OH 0 „ 

U H H n< 

c u a 4 

•s ° ^ 

l-< bC 2 ? 

05 O 

XU*' •* « 

17 h tn cj _i 

m -h o ^ 
■t) nx: u 

CJ c D +3 

C ci< ^ «H <y 

U. b ° 

•.i: w ^ 

rH O P. >, 
« U (tJ rH © 
U C U t« H 
® O h£*H h 
C «H O V *3 
« O C* O v* 

to n w n O 



A 

>0 

On 



NO 

rH 

Ov 

• 

P. 



'S *< 

S-s 



■*3 

II 

rH — ' 

&3 



b vr\ 



•d 5 3 

C O rfl 
rt Si 

- ® 
X3 •» O 
ftf(H 
CO c o 
P o sz 

tXi «H O 

cO -P i 

u n o 

3 ® rH 

P. 3 P- 
cr -f-t 
C X> 

* *rH ® rH 

-p 2 

5 £ e 
S §.5 

H 1) H 



vO 

Os 

rH 



& 

rH 

rH 

i§ 




I " 


© 






% 






M 






1 


© 






4 






u 




• ir> 




i 




T> X 


cn 

*o 


«s 

N 




A 




• 


1 


H 


cn 


« go 


Os 


ir\ 




B 




P. 








» ro rH 


rH 


Os 






CO 





II 

11 



!3 

& 




'“Sg'O 

® * C 2 • 

2 ex o o px 

® H rl 
U » U -H ^ 

P CJ 

P r\ P. 

P- ^ * \0 

. Ov fj JT On 

*p o o £ 
-p P £ P 

5 ? « o to 

3 HU H H 

X) O ® cfl CJ 
C U 

"l* >$! 

. ‘in 01 

H H id O cj 

J ux x: u 

4) M U o 

•H O -*H 

q h c n p 

aJ ' — ' tit P,V*r 



■g 

« 



O px 



C OH 
P NO 
On 

T3 rH 



P* M 

.5 

r- P 

o a> 

Os uo 



Xi r- 
® ® \A 
-C H CA 
P D. rH 
■H -*H 
© -P «. 



P « L> 

o a *■— ’ 

■H O 
43H I) 

©ox. 

O JC ® 
= u It 
O' CO 

® © s 

p £ 
cO W © 
P c o 
fO c3 "H 
P- o 

© O jC 

n s u 



iH 

qj 



O P 

fcgs 

o © 

♦Jpp 

o 

Cm TJ 

O C © 

O P 

C © 

O p P 

p © 

p p _ 

tc a >* o 

N U rH CJ 

p an © 

3 u u p 
© © © 
tiC p U C.H 

e, p */h d 

OrlDH 



« o 

© s— 

xJ 



© . 



°s a t . 

O Ct «s O o 
H 3 SrH r^\ 
P. c ro Hvn 

O © Os O Os 
p »H X H 



ex 

e 



c 

ro 

ii 



cr> 

g 

M 

R 

g 

o' 



o 

ERiC 



24 



Reading Exercises named "details", "main idea," and "following direc- 
tions." Test-authors seemed less concerned with specificity. One 
test author merely stated that the test was evaluating "extremely 
simple recognition to the making of inferences (Kelley, et al, 1964a, 
p. 5)." Hence only very limited information about the authors' 
conception of reading comprehension could be gained from test descrip- 
tions or manuals. Chall (1967, p. 312) has noted that "standardized 
reading tests often mask some of the important outcomes of instruction 
because they measure a conglomerate of skills and abilities at the 
same time." 

This confusion about reading comprehension is summed up by Kolers" 

We cannot yet describe accurately even what it is 
we are measuring when we measure "comprehension" 
in reading tests, or what we mean by "understanding," 
and we cannot yet say accurately what it is we mean 
by "meaning". ... (Kolers, 1968, p. xxxiv) 

Despite the lack of knowledge with regard to the teaching, testing 
and nature of reading comprehension, many individuals do read adequately 
to meet the needs of everyday life. Obviously, then, something in the 
educational experience of these individuals has been effective. Analy- 
sis of the skills of effective readers may contribute to an under- 
standing of reading comprehension generally.^ 

Despite their limitations, standardized reading comprehension tests 

■^Smith (1971) presented a model of the reading comprehension pro- 
cess derived from an analysis of mature reading. The approach 
presented here differs in that "comprehension achievement" as demon- 
strated by empirically constructed standardized tests at the elementary, 
intermediate and advanced grade levels is analyzed. 
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offer a source of empirical data for understanding "reading comprehen- 
sion." Norraing procedures, including detailed analyses of item 
difficulty, produce an empirically valid progression of reading 
performance. An analysis of reading comprehension tests holds promise 
for revealing 

1. the nature of the comprehension task 

2. whether the' task differs by grade level 

3. whether the task differs by test battery 

4. what determines difficulty of the task. 

\ 

Summary 

Because there seems to be no clear or consistent definition of 
reading comprehension, significant differences appear among materials 
designed to teach and test it. A study of comprehension tests will 
reveal what tasks are currently used as the criteria for comprehension. 

The remaining chapters of this dissertation will consist of an 
analysis of the historical development, structure and content of 
selected standardized reading comprehension tests. The analysis in- 
eludes investigation of language and performance factors in selections, 
questions and choices of tests. 

Chapter II presents the development of standardized reading 
comprehension tests. Chapter III briefly introduces the objectives 
of this dissertation, the reading comprehension tests that were 
studied, and the analyses that were conducted. Chapter IV presents 
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a comprehensive study of the readability of reading comprehension 
tests. Chapter V presents a comprehensive study of the tasks in 
reading comprehension tests. On the basis of these analyses 
a suggestion is made for new tests of comprehension in Chapter VI. 



CHAPTER II 



Development of Standardized Reading Comprehension Tests 



Historical Foundations in Testing 



Edward L. Thorndike initiated standardized testing in reading 
comprehension. In 1914 he published a major article on the topic in 
the Teachers College Record entitled "Measurement of Ability in 
Reading." He began by experimenting with different "degrees" or types 
of understanding: 

. . .What degree of understanding we require in our 
test is of almost no consequence, but that we 
should define objectively the degree of understanding 
that we do require is of very great importance.... 
(Thorndike, 1914, p. 226) 

Toward this end, he devised the "Visual Vocabulary Scale" . This set 

of tests did "not measure ability to understand the meaning of these 

printed words in general , or, as they come in ordinary texts , or 

completely , but only to understand them well enough to classify them 

as required by the test . (Thorndike, 1914, p. 226)": 

Look at each word and write the letter F under every 
word that means a flower . 

Then look at each word again and write the letter A 
under every word that means an animal. 

Then look at each word again and write the letter N 
under every word that means a boy's name . 

• • • 

• • • 

4. camel, samuel, kind, lily, cruel 

5. cowardly, dominoes, kangaroo, pansy, tennis 
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6. during, generous, later, modest, rhinoceros 

7. claude, courteous, isaiah, merciful, reasonable 

... (Thorndike, 1914, p. 209) 

Another set of tests, "Scale for Measuring the Understanding of 

Sentences and Paragraphs ", was designed by Thorndike to measure pupil 

ability to answer questions about a series of sentences. He stated 

that : 



Mere word knowledge is much less important than the 
ability to get the message carried by a continuous 
passage. Competent judges would rate the latter as 
from sixty to ninety per cent of the total result to 
be sought by the elementary school in the teaching 
of reading. Probably no other one scale for educa- 
tional measurement is so important as a scale for 
measuring the understanding of sentences and paragraphs. 
(Thorndike, 1914, p. 238) 

Actually the Scale for Measuring the Understanding of Sentences and 
Paragraphs was made up of two subgroups. One group of sentences con- 
tained narratives or anecdotes. Students were asked to read the 
sentences and then answer questions: 

In Franklin, attendance upon school is required of 
every child between the ages of seven and fourteen 
on every day when school is in session unless the 
child is so ill as to be unable to go to school, or 
some person in his house is ill with a contagious 
disease, or the roads are impassable. 

1. What is the general topic of the paragraph? 

2. On what day would a ten-year-old girl not be 
expected to attend school? 

3. Between what years is attendance upon school 
compulsory in Franklin? ... (Thorndike, 1914, 
p. 267) 

The other group of sentences contained directions, which were 
quite simple at the lower levels, but complicated by numerous qualify- 
ing conditions at higher levels: 
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In these two lines draw a line under every 5 
that comes just after a 2, unless the 2 comes 
| just after a 9. If that is the case, draw a 

line under the next figure after the 5: 

536254174257654925386125 
4735239258479256125748 5 6 

(Thorndike, 1914, p. 247) 

I 

These comprehension questions corresponded to Thorndike's 

i conception of understanding: 

Understanding a paragraph is like solving a 
j problem in mathematics. It consists in selecting 

I the right elements of the situation and putting 

them together in the right relations and also 
! • with the right amount of weight or influence or 

j • force for each. The mind is assailed as it were 

by every word in the paragraph. It must select, 
repress, soften, emphasize, correlate and organize 
| all under the influence of the right mental set or 

! purpose or demand. (Thorndike, 1917b, p. 329)1 

| After developing the comprehension tests, Thorndike administered 

them to large numbers of children and conducted careful analyses of 

| errors. From the error analysis, he concluded that mistakes on tests 

were due to three causes. The first error resulted from mistakes in 

1 

word definition. Pupils attributed either wrong or inadequate meanings 
| to words in the paragraph or question and developed their answers 

around this misinterpretation. For example, in the previous paragraph 
S about the rules for school attendance in the city of Franklin (p. 18 ), 

, • some students defined Franklin as a man's name rather than the name of 

i : 

a city. Other students went even further and confused Franklin with a 



Thorndike's view of the nature of comprehension was really an 
outgrowth of his more general connectionistic theory of learning. 

, See Hilgard (1956) Chapter II. 
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particular man in such answers as "a great inventor." As Thorndike 
described the errors, "...a word may produce all degrees of erroneous 
meaning for a given context, from a slight inadequacy to an extreme 
perversion (Thorndike, 1917b, p. 327)." 

Thorndike called the second type of error "over-potency." Over- 
potency resulted when pupils chose an element such as a fact in the 
paragraph, a word in the question, or a fact from general experience, 
attributed undue importance to it, and formulated an answer around it. 
For example, in the previous paragraph about Franklin, pupils who 
stated that the topic of the paragraph was "Franklin attends school" 
gave over-potency to the element "Franklin." 

The third type of error — a complement of the second — was called 
"under-potency." Under-potency referred to mistakenly ignoring the 
influence of a word in the paragraph or question. Using the example 
of school days in Franklin again, students were asked, "On what day 
would a ten-year-old girl not be expected to attend school?" 

Students demonstrated under-potency of the word "not" in answers such 
as "when school is in session" or "five days a week (Thorndike, 1917b, 
p. 328)." 

As a result of his investigations, Thorndike made three observa- 
tions about reading comprehension. First, mental set was very 
influential in the way students understood selections and answered 
questions. Second, reading comprehension difficulty could be due to 
the structure of either the question or selection. Third, a dis- 
crepancy could exist between understanding the words and understanding 
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the task. For example, even though a pupil might understand the words 
in the selection and the question, he might not understand what he is 
expected to do in order to demonstrate his comprehension. More often 
than not, the way the comprehension tests were organized made it im- 
possible to establish which of these aspects led the student astray: 

One could in fact make a scale. . .harder, using just 
the same paragraph and using questions simply phrased, 
but demanding the understanding of more and more in- 
tricate, subtle, and technical features of the para- 
graph. Eventually, we may expect to have at least two 
scales, — one of harder and harder paragraphs or ques- 
tions or both, each to be read perfectly, the other of 
a few paragraphs to be read with increasing degrees of 
fullness and exactitude. The present scale is a mixture. 
(Thorndike, 1915, p. 460) 1 

As a result of his third observation, Thorndike began to experi- 
ment with both verbal (answering a question in narrative form) and 

action (answering a question by following directions) responses in 

2 

his paper and pencil tests. Thorndike stated that "measures of 
ability to understand should be unconfused by ability to express one- 
self orally or in writing (Thorndike, 1914, p. 227)." He therefore 
preferred multiple-choice and short-answer questions to the longer, 
less restrictive essay questions. 

Thorndike explicitly stated that his tests were not designed to 
diagnose skill deficits. A teacher would not know a pupil's specific 
strengths or weaknesses in reading from a score on these tests. Nor 
would a teacher know what should or should not be taught in reading 
. comprehension. The tests did not set standards or objectives for 

*Most current tests still represent a mixture. As yet, there are 
no valid independent measures of selection and question difficulty; and 
thus there are no systemized scales that vary the one or the other 
knowledgably. 

2 

See sample items on pp.17,18. The Franklin item represents a narra- 
tive response while the drawing lines item represents an action response 



instruction. All a teacher would know from a pupil's score on these 
tests was how well the pupil could perform on a certain combination 
of reading test items in relation to many other pupils of correspond- 
ing age or grade. 

The assumptions underlying the construction of Thorndike's tests 

were generally that "achievement of paragraph reading is distributed 

approximately in the form of the so-called normal probability surface... 

and "...that the variability of any grade from thr>. fourth to the twelfth 

is approximately equal to that of any other (Thorndike, 1916, p. 41)." 

Thus, items were designated for a specific grade level if they were 

passed by the major proportion of pupils at that grade level, by a 

lesser proportion of pupils at the adjacent lower grade, and by a 

greater proportion of prills at the adjacent higher grade. However, 

that did not help teachers establish whether or not pupils could read 

and understand textbooks or more general types of reading matter. The 

test scores merely reflected a kind of natural phenomenon: 

What will be achieved as the science of education progresses 
can not be stated. What should be achieved now if the best 
known methods were used by the best teachers now available, 

I will also not try to estimate. What are called "standards" 
here are simply achievements a little above those actually 
made in schools under the possibly disturbing conditions of 
test (sic) by an outsider. 

A school whose pupils are able to read as well as this is 
probably doing better than the general run of schools, but 
...it is not achieving enough to enable its pupils to read 
easily the text-books they are studying, to say nothing of 
more difficult discussions in newspapers and magazines. 

(Thorndike, 1915, p. 458) 

Thorndike's awareness that only a relative comparison among pupils 
was possible with his test, did not prevent him from suggesting some 
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objectives for a desirable level of reading achievement; nor did it 
prevent him from trying to make the test as pure a measure of compre- 
hension as possible. However, despite attempts to isolate the 
ability to understand paragraphs, he had to reconcile himself to the 
fact that testing would probably be limited to measuring combinations 
of factors: 

The scale even when properly administered will 
occasionally measure a mixture of general stupidity 
or indolence or mischief with an Inability to under- 
stand words. Probably no scale could be objective 
and convenient in use without suffering from this 
limitation. (Thorndike, 1914, p. 226) 

Although it had limitations, standardized group testing as first 
developed by Thorndike permitted th<? evaluation of classes, teachers, 
methods, and schools by more objective criteria than were previously 
available. For the first time comparisons among school districts, 
socio-economic and ethnic populations became possible. In part large 
testing programs were facilitated by Thorndike's introduction of 
scoring keys and record sheets. He made scoring and tabulating so 
simple that it could be done very quickly even by non-professionals. 

Two more aspects of Thorndike's work should be noted. First, he 
measured the time factor. He suggested that students should not be 
given speeded tests since unnecessary anxiety might be produced. How- 

j 

ever, he thought reading rate was valuable information for the teacher. 
Generally, he anticipated that older children would work faster than 
younger ones, and that more intelligent children would work faster 
than duller ones. 

Second, Thorndike identified comprehension with thinking: 

i 
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Understanding a spoken or printed paragraph is 
then a matter of habits, connections, mental 
bonds, but these have to be selected from so 
many others , and given relative weights so 
delicately, and used together in so elaborate 
an organization that "to read" means "to think," 
as truly as does "to evaluate" or "to invent" or 
"to demonstrate" or "to verify." (Thorndike, 

1918, p. 114) 

To summarize, Thorndike made significant contributions to the 

validity, construction, design, administration and scoring of reading 

comprehension tests, Thorndike 

1. strived to specify the kind of comprehension being tested 

2. Introduced standardization and norming procedures 

3. identified discrepancies among selection-question-response 
difficulty 

4. demonstrated appropriate use of the time factor 

5. facilitated simultaneous testing of many students 

6. developed quick and economical scoring procedures. 



Innovations in Testing 

Since E. L. Thorndike the most dramatic changes in the develop- 
ment of reading comprehension tests have been technological. Currently 
published standardized tests are normed on as many as 260 school 
systems in 50 states and on 850,000 pupils (Kelly, et al, 1964a). 

Widely used tests include about 15,000 questions in their experi- 
mental forms (Kelley, et al, 1964a). Percentages are computed for the 
number of children choosing each multiple-choice distractor. On the 
basis of these percentages, "item profiles" are constructed: 




These item profiles were considered one of the 
most important indices of item validity, and 
considerable weight was attached to them in the 
selection of items for the final forms. Results 
of this item tryout permitted identification of 

. - ambiguous items, of items either too easy or too 

difficult for the grades for which they were 
intended, and items unsatisfactory in other 
respects. Such items were eliminated from con- 
sideration for retention in the final forms.... 

(Kelley, et al , 1964a, p. 26) 

In addition to improvements in norming procedures, considerable 
refinements have been made in the mechanics of test administration and 
scoring. Time restrictions are investigated during the norming pro- 
cedure, and the limit selected represents the amount of time required 
by a specified percent of the normitig population to complete the 
prescribed task. Statistical Innovations, particularly the development 
of stanines, permit score comparison by equal units. Computers make 
possible the scoring of answer sheets rapidly and at minimal cost 
(Harcourt, Brace and World, 1968 ; California Test Bureau 1968). 

Another major innovation has been the introduction of cloze 
procedure.^ Cloze procedure is a systematic deletion of every nth 
word in a. passage of prose. Usually every 5th word is deleted. Pupils 
are asked to fill' the blanks. Mostly, only an exact replacement of 
the deleted word is marked correct (Taylor, 1953; Tremont, 1967; 



^A more detailed analysis of cloze as a test of comprehension 
is given by Tremont (1967, p. 50-66) and Bormuth (1969b). 
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Bormuth, 1969b). ^ First introduced by Taylor in 1953: 

Cloze procedure derives its name from the "closure" 
concept of Gestalt psychology. Just as there is an 
apparent human tendency to "see" a not-quite-complete 
circle as a whole circle - by '‘mentally closing the 
gap" and making the image conform to a familiar 
shape - so does it seem that humans try to complete 
a mutilated sentence by filling in those words that 
make the finished pattern of language symbols fit 
the apparent meaning. (Taylor, 1957, p. 19) 

Cloze procedure has solved some problems in testing while at the 

same time creating new ones. The cloze test is considerably simpler 

to construct than a question test. It also eliminates the interference 

of question content and structure in tests of comprehension (Bormuth, 

1966, 1967; Simons, 1970). Unfortunately, however, it is not always 

2 

clear what cloze is measuring. Taylor (1957) found cloze scores to be 



This type of scoring penalizes pupils who may put down a perfectly 
valid, although not identical, answer. Thus, even though these pupils 
in fact demonstrate comprehension, their performance is rated as 
inadequate. Cloze shares this problem wit*- all other comprehension 
tests that require one answer where many may be equally appropriate. 

Thus, accuracy in testing is sacrificed for simplicity in scoring. 

Trenaman (1967) discusses in somewhat greater length how people 
may validly differ in their understanding of a language communication. 

2 

For example, Tremont (1967, p. 66) suggested cloze "may be an 
excellent test for measuring the interrelationships among ideas;" and 
thus better than most traditional tests, which he concluded, "measure 
word meanings , - literal meanings ~of sentences , and only occasionally 
consider measuring relationships among ideas." Bormuth (1969b, p. 365) 
concluded that cloze tests "measure skills closely related or identical 
to those measured by conventional multiple-choice reading comprehension 
tests." And, Simons (1970, p. 14) concluded that cloze is a "better 
measure of comprehension because... it appears to be measuring fewer 
extraneous aspects of cognitive functioning than traditional tests do." 

Generally, deciding what best measures comprehension seems very 
much dependent on how reading comprehension is defined. The present 
lack of information with respect to cognitive functioning in the case 
of reading makes it very difficult to establish which cognitive func- 
tions are or are not extraneous to reading. 



37 



a dependable index of mental ability, of previous knowledge, and of 
information known after reading a given prose passage.^ - Furthermore, 
he established that the parts of speech deleted determined the dif- 
ficulty of the tests. Deletions of every nth noun, verb and adverb 
created the test of greatest difficulty. Deletions of every nth 
adjective or preposition proved to be of intermediate difficulty. 
Deletions of every nth auxiliary verb, conjunction, pronoun or article 
created the simplest test. Taylor also constructed what he called 
"any" tests, by deleting every nth word irrespective of its part of 
speech. The "any" type of cloze was easier to construct than the other 

v': 

forms. It also proved most satisfactory in providing stable results 
and discriminating among testees. Furthermore, Taylor (1957) 
established that his most difficult test (deletion of every nth noun, 
verb or adverb) was the best indicator of previous knowledge. 

Current widely-used comprehension subtests are not designed by 
regularly deleting every nth word. Rather 1, 2 or 3 words in a sentence 
or paragraph are deleted with no rationale or explanation given by the 
test ^authors . An informal review by the present writer of cloze-like 
blanks in selected tests of reading comprehension revealed -that deleted — 
words were generally nouns, verbs, or adverbs. Thus, if Taylor’s (1957) 
findings may be applied here, these tests would be testing previous 
knowledge. 

^Taylor (1957) used a technical report on Air Force supply systems 
as his reading selection. The subjects of his study were 152 Air Force 
trainees. However, his findings about the difficulty of tests con- 
structed by deleting the various parts of speech seem to have more 
general significance. It seems to the present writer that the relative 
difficulty of these types of tests may remain constant both for different 
types of prose passages and for different groups of readers. 
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Another Innovation in testing has been the almost universal 
acceptance of multiple-choice. Rather than writing their own answer, 
pupils are asked to respond with one out of four or five answer 
choices. Generally one choice is the correct answer and the other 
choices act as distractors. The benefits of this innovation in testing 
are quick and objective scoring. A disadvantage of this innovation is 
that the pupil's answer to a multiple-choice item "may be influenced 
by the distractors from which he has to choose just as much as by 
the question part of the item (Schlesinger and Weiser, 1970, p. 569)." 
Bormuth (1966, p. 82) also contended that "it is notoriously easy to 
vary the difficulties of these tests simply by changing the alternatives 
to the question." Since little is known about distractor combinations, 
little is known about what a pupil must do to choose the correct 
answer, Guttman and Schlesinger (1967) have begun studying the types 
of errors pupils make in choosing distractors. They have concluded that 
there are consistencies in types of errors pupils make, and that iden- 
tification of these consistencies may prove diagnostically useful. 

After studying current developments in educational testing, 

p—-£7-Vernon- concluded - that: 

Whatever the subject matter - English, social studies 
or natural sciences - they tend to take the form of 
complex reading comprehension tests, and they there- 
fore appear to depend partly on the students' facility 
in understanding the instructions and coping with 
multiple-choice items. (Vernon, 1962, p. 269) 

Vernon supported his conclusion by pointing out that the correla- 
tions between tests aimed at different mental functions or different 
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school subjects were extremely hlgh.V He further hypothesized that the 
differences among tests may ,r be due merely to the Imperfect reliability 
of the contrasted tests (Vernon, 1962, p. 270)." 

Investigations Into the Nature of Reading Comprehension 

/.V 

The review of the literature thus far has focused on the develop- 
ment of reading comprehension tests. However, this development was 
not Isolated. Rather, it was related to the studies of scholars from 
many disciplines who Investigated reading comprehension for different 
purposes and by different approaches. 

The three most prevalent methodologies used to study comprehen- 
sion seemed to evolve In a historical sequence. The first type of 
study was philosophical Investigation based on intuition and logical 
analysis. Topics of concern included the goals of reading comprehen- 
sion, means to attain these goals, relationship of thought to language, 
and relationship of reading matter to understanding. The treatise of 
Locke (1697) on the Conduct of the Understanding , the writing cf 
Stewart (1811) In Philosophical Essays , and the analysis of Smart 
(1855) In Thought and Language areTexaraples of early philosophical 
works concerned with reading comprehension. 

^Vernon quoted the mean Intercorrelation for five of the Iow/a 
Tests of Educational Development (basic social concepts, reading I 12 
social studies, reading In natural science, interpreting literary 
materials and vocabulary) among 9th and 12th grade students of .716 
while the tests had a mean reliability of .905 (Vernon, 1962, p. 270). 



\ 



49 



Experimental psychology provided a second approach to the study 
of reading comprehension. Philosophical investigations, such as the 
work of Richards (1929), and Wittgenstein (1958) continued, but the 
newer experimental techniques seemed more prevalent. 

The' experimental studies were characterized by the testing of 
given hypotheses about reading comprehension. Criteria had to be 
specified that would either substantiate or refute the hypotheses. 

Thus, an increased interest developed in specifying the desired goals 
of comprehension, as well as criteria for determining the occurrence 

of comprehension. Ingenious machines to record the rhythm and sequence 

* 

J 

of visual movements were developed. Readers' introspections were 
noted and analyzed. Investigations of empirical phenomena — observing 
behaviors of large groups of readers — were also conducted. The works 
of Huey (1908, reissued 1968), Thorndike (1914-18), James (1928) and 
Skinner (1937) exemplify the experimental period. Generally, the 
experimental approach resulted in testing— substantiating or refuting — 
some of the speculations proposed earlier by philosophers. The 
experimental approach also created the need for better objective 
evaluation of reading data. Statistics thus provided the third 
approach to studying reading comprehension. As before, previous 
approaches were not rejected. Philosophical and experimental studies 
continued. Occasionally studies of comprehension incorporated all 
three approaches.^ However, statistical studies became most prevalent. 



^Levin and Williams (1970) present an interesting combination of 
recent studies of reading. Although the studies incorporated in their 
book are largely of an experimental nature, both philosophical and 
statistical approaches are represented. 
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The statistical analyses were generally of two types. One type 
was process-oriented. Studies of this type analyzed and interpreted 
factors and correlations that reflected the comprehension process it- 
self (Gans, 1940; Langsam, 1941; Davis, 1944, Hall and Robinson, 1945; 
Thurston, 1946; Anderson, 1949; Johnson, 1949; Sochor, 1959; Alshan, 

1964; Holmes and Singer, 1966; Davis, 1968; Trenaman, 1967)., The 
studies were not always based on widely-used standardized tests of 
reading comprehension. For example, Davis in 1944 developed his own 
questions by making "...a careful survey... of the literature to 
identify the comprehension skills deemed most important (Davis, 1968, 
p. 504) . 11 

Conclusions drawn from process-oriented studies have produced 

inconsistent results. Some investigators concluded that reading 

comprehension had numerous factors while others concluded that^compre- 

hension was one general factor.^ As might be expected, the resultant 

factor or factors reflected the structure and content of the tests as 

2 

well as the statistical treatment. Sometimes, the resultant factors 
were almost identical to the criteria used for devising test items. 

Of ten , “in factor' analytic studies, the criteria as well as the outcome 
factors suffered from confusion of requirements for reading, pre- 
requisites for reading, procedures for teaching reading and skills or 

^For a lengthier discussion of the controversy among the factor 
analytic studies in reading comprehension,, see Hunt (1957). 

2 

Davis (1944, p. 185) stated, "Unless these tests provide 
reasonably valid measures of the most important mental skills that have 
to be performed during the process of reading, the application of the 
most rigorous statistical procedures can not yield meaningful and sig- 
nificant results. The importance of this point can hardly be overstated." 
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abilities used in reading (Strang, 1965; Robinson, 1966). 'Despite the 

i it' 

disagreements among studies a few consistent findings did appear. Most 

**.„ ' • • 

* 

of the studies that identified a number of comprehension factors seemed 
to agree on four: vocabulary factor, interrelationship among ideas 
(represented by words in context) factor, abstract reasoning factor, 
and specific content field factors (Jenkinson, 1968; Simons, 1970). 

The second type of statistical analysis investigated the 
relationship of factors outside the comprehension process Itself to 
comprehension as measured by standardized achievement tests. Among the 
variables studied were age, sex, race, socio-economic status, person- 
ality traits and intelligence (Bleismer, 1954; Gates, 1961; Vehar, 1962; 
Cooper, 1964; Chandler, 1966; Coleman, 1966; Harootumian, 1966; Neville, 
Pfost and Dobbs, 1967; Dykstra, 1968). Because of the differences in 
the reading tests and in the size and composition of the samples used, 
few valid generalizations could be drawn from all these correlational 
studies. Two generalizations that seemed consistent, however, were the 
positive correlations of tested reading comprehension with tested 
general intelligehceahdwith socio-economic factors . Both of these 
correlations increased with increasing age. 

Roger Farr (1969) in his comprehensive study, Reading: what can 
be measured ? reviewed and synthesized major contemporary research in 
reading comprehension. He discussed the continued controversies in 
measuring comprehension: emphasizing reading rate vs. comprehension 
power, permitting reference to the reading selection vs. removing the 
selection, controlling for previous knowledge vs. ignoring it, 
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establishing purposes for reading vs. not doing so, testing solely 
for syntax vs. testing for many "skills," varying lengths of reading 
selections vs. keeping constant lengths, and controlling for 
personality traits vs. ignoring them. Farr (1969, p. 56) concluded 
that "there is still a lack of understanding about the basic aspects 
of reading comprehension." 

The review of the philosophical, experimental and statistical 
studies of reading comprehension led the present writer to conclude 
that there is considerable interdependence among them. The results 
of "objective" experimental and statistical analyses are generally 
colored by intuitive and subjective criteria. In most experimental 
and statistical studies, tests were usee as the criteria of reading 
comprehension. Validity of the tests was judged by the researcher 
and test-author. 1 Often the researchers’ and test-authors' conception 
of comprehension resulted from one or a combination of earlier philo- 
sophical positions. Tests and experimental designs tended to reflect 
at least in part, one or another philosophical orientation. Subse- 
quently, differences among conclusions resulting from statistical 
analyses and "experiments" reflected to some extent the corresponding 
differences in philosophies, and thus were not totally "objective." 

Understandably, Farr suggested that "The most pressing research 
need in measuring comprehension is for a clear understanding of the 

^Guilford (1946, p. 437) points out that "Even sophisticated 
judgment often goes astray on decisions as to what a test measures. 

A test designed to measure common sense judgment when factor analyzed 
turns out to be a test of mechanical experience. A test designed as 
a reasoning test is found to be one of numerical facility, when analyzed.... 
The moral of it is that in test construction..., things are not always 
what they seem." 
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1 nature of reading comprehension (Farr, 1969, p. 64)." Finding where 

| to begin such an undertaking is no easy task. Miller (1967, p. 90) 

pointed out that '.'No psychological process is more important or 
difficult to understand than understanding, and nowhere has scientific 
psychology proved more disappointing to those who have turned to it 
for help." Therefore, in the present writer’s opinion, it may be most 
practical to start with an analysis of empirically constructed (stan- 
• dardlzed and tiormed) comprehension tests. These tests are the accepted 

criteria of reading comprehension. There is, however, no clear 
| understanding yet of what these tests are measuring. An understanding 

of widely-used reading comprehension tests may lead to a better 
understanding of what is currently being called reading comprehension. 
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CHAPTER III 



Analysis of Standardized Reading Comprehension Tests 

Introduction 

Standardized reading comprehension tests have been, for many 
years, the accepted method for evaluating reading comprehension. 

Because of this these tests offer a source of empirical data which, 
if analyzed properly, may improve the understanding of reading 
comprehension. 

The elaborate norming procedures and item analyses that charac- 
terize standardized tests establish an empirical scale of comprehension 
difficulty.^ - Test-authors select items for a given grade level only 
if the items successfully differentiate between the high and low 
achievers at that level and also between that level and adjacent 
levels. To date, no one seems to know why or how these items differen- 
tiate. However, the fact that the test items do empirically discrimi- 
nate among pupils and grades implies strongly that comprehension items 
reflect an underlying structure or sequence of reading comprehension 
tasks. A systematic analysis of reading comprehension test items 

r / 

therefore, may reveal this structure. 

"*^$ee Chapter II, p. 17 of this dissertation for a more detailed 
accouiit of the procedure used in developing tests. 



Objectives 



The objectives of this systematic analysis of reading compre- 
hension tests were to: 

1. characterize the nature of reading comprehension as tested 
at three grade levels (grades 1-2, 4-6, and 9-14); e.g.. 

Are there differences in what pupils have to do or know in 
order to demonstrate reading comprehension on. tests at the . 
different grade levels? What kinds of changes occur from 
one grade level to the next? 

2. characterize the nature of reading comprehension as tested 
by different test batteries; e.g.. Are there differences 
in what pupils have to do or knov; in order to demonstrate 
a given level of reading comprehension in different test 
batteries? What kinds of differences exist? 

3. identifycjfactors that may contribute to difficulty in tested 
comprehension; e.g.. What are the factors that make one test 
question more difficult than another; or one test more 
difficult than another? 

4. characterize the nature of tested reading comprehension; 
e.g., What do comprehension tests test? What does a pupil 
have to do in order to demonstrate reading comprehension on 
these tests? What does a pupil have to know in order to 
demonstrate reading comprehension on these tests? 



Reading comprehension sub tests selected for this study were 
the California Achievement Test (1963), form W, comprehension/ 
interpretation subtest (CAT ) , Gates-McGinitie Reading Test. (1965, 
1969), form 1, comprehension subtest (GMRT) , and Stanford 
Achievement Test (1964, 1965), form X, paragraph meaning/ reading 
subtest (SAT ) 

The sub tests chosen were designed by their authors to 
measure understanding and comprehension (see Table 2, p.l? ). 

The test batteries (CAT , GMRT , SAT) were selected on the basis 
of five criteria: ' 



The names of the comprehension subtests varied among the 
batteries. The CAT subtest at the lower primary level was called ~ 
comprehension. However* at the higher levels the comprehension sub- 
test on the CAT was broken down into three parts: interpretation, 

following directions and reference skills. The "interpretation" 
part was analyzed in this study since it most closely resembled the 
comprehension subtests of the other batteries. The "following 
directions" part of the CAT comprehension sub test was made up of 
short passages giving math, history or science information, a 
direction requiring the identification or application of the given 
information and four answer choices. The "reference skills" part 
included questions usually requiring knowledge of reference materials 
(e.g. dictionaries, maps, graphs) and four answer choices. 

The SAT comprehension subtest was called paragraph meaning at the 
lower levels, however, on the highest level (high school) the subtest 
name was "reading." In both the SAT md GMRT the highest level tests 
were published later than the rest of the battery. Thus, while most - 
of the SAT battery was published in 1964, the High School test was not 
published until 1965. Similarly with the GMRT, most of the battery 
was published in 1965, but the highest level test, Survey F, was not 
published until 1969. •/ 
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that the reading comprehension subtests be comparable ;* 

2. that the tests be standardized; 

3. that the tests be normed at relatively corresponding grade levels; 

4. that the tests be widely used in the United States; 

5. that the tests be distributed by different publishers. 

Three grade levels were chosen within each test battery in 
order to observe the progression of reading comprehension difficulty. 
The lowest level tests were for grades 1 and 2, the intermediate level 
for grades A through 6, and the highest level tests were for grades 9 
through 14.^ 



Although the subtests were generally comparable in format, a 
number of differences existed. These differences were most evident 
at the lowest grade level. The lowest level CAT and GMRT contained 
a number of “direction'*! terns. In these items answer choices followed 
the "direction", but there was no reading selection as with most 
items on the lowest level SAT subtest and on all higher level subtests. 
In addition, the first grade level GMRT presented picture answer 
choices while all other . subtests analyzed had word answer choicest 
Also, the first grade level CAT included one open-ended (no choices) 
item and one "direction" that consisted of copying the initial letter 

of a word; and, two were mutilated words that had to be fixed., 

\ 

2 

The grade levels for which the subtests from different batteries 
were intended by their authors were not entirely consistent. Specifi- 
cally, at the lowest grade level., the GMRT (Gates„.and..MacGinitie j 1965a, 
p. 1) was intended for first grade only, while the CAT (Tiegs^.and 
Clark, 1963c, p. 1) and the SAT (Kelley, eit ad, 196Ac, p. 1) were in- 
tended for grades 1 and 2. At the intermediate level the SAT (Kelley, 
et al, 196Ab , p.2) authors wrote two tests: one for grade 4 and one for 
grades 5 and 6. The CAT (Tiegs and Clark, 1963b, p. 1) and the GMRT 
(Gates and MacGinitie, 1965b, p. 1) authors published only one test for 
grades A through 6. At the highest level, the CAT (Tiegs and Clark, 
1963a, p. 1) was intended by its authors for grades 9 through 14, while 
the GMRT (Gates and MacGinitie, 1970, p. 1) was intended for grades, 

10 to 12, and the SAT (Gardner, et al, 1965, p. 1) was intended for 
grades 9 to 12. 



The tests analyzed generally had similar formats. Most compre- 
hension items consisted of a reading selection, a question, and 
choices. The reading selection usually consisted of a sentence, a 
paragraph, or a number of paragraphs. The selection either contained 
a number of cloze-like blanks, or was followed by one or more 
separate questions. Four or five answer choices were generally pro- 
vided by the test-author. Pupils were required to choose the "choice" 
which correctly filled the "blank" or answered the question. 

Table 3 summarizes the number of selections, questions, and 
choices analyzed in each subtest and at each level. ^ A total of 
165 selections, 455 questions and 1902 choices were analyzed in all. 



Analyses' 



Only a short introduction to the types of analyses conducted in 
this study will be presented here. In Chapter IV and V more specific 
descriptions will be given of the analysis procedures. 

Reading selections, questions, and choices were analyzed in two 
ways. First a Dale-Chall (1948) and Spache (1953) readability analysis 
was made of each selection, question and choice. These and similar 
readability formulae are used to appraise objectively the relative 
difficulty of basal readers, textbooks, encyclopedias, newspapers, and 
standardized tests (Chall, 1956, p. 89). The predictions of reading 
^difficulty of the Dale-Chall (1948) and the Spache (1953) readability 
formulae are based on counts of the number of difficult words in a 
reading selection and also the average number of words in the 



Hfhere the reading' selection contained cloze-like blanks, the 
sentence containing the blank was counted and later analyzed as the 
question. 






The Number of Selections, Questions and Choices 
Analyzed in this Study by Test and Level 
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sentences of the selection (Chall, 1958b; Klare, 1965). ^ 

The second type of analysis was designed especially for this study. 
Three judges rated each selection, question and choice. Selections were 

4 

rated according to topics or general subject areas (e.g. history, science 
etc.). Questions were rated according to the relationship of the cor- 
rect answer choice to the word(s) presenting the information in the 
selection (e.g. same word in a different context, grammatically differ- 
ent, etc.). Wrong choices, called distractors, were rated according 
to their relationship to the selection (e.g., in the selection or not), 
to the question (e.g., grammatical answer or not), and to the correct 
choice (e.g., coordinate, superordinate, etc.) 

The""data obtained from the two analyses described above were 
studied for clear and/or consistent trends rather than for specific 
statistically significant differences. Since there ware often wide 
discrepancies in the number of cases to be compared and since a great 
number of statistical relationships would have been explored, the 

assumptions underlying most statistical tests of significance were or 

2 

would have been violated. 



Difficult *?ords- in these formulae are identified by their absence 
on given lists of easy words. The Dale-Chall formula (1948) uses the 
Dale List of 3000 Familiar Words and the Spache (1953) formula uses 
Clarence H. Stone's Revision of the Dale List of 769 Easy Words. 

2 

The problem of carrying out many non-independent tests of sig- 
nificance is discussed by Kendall and Stuart (1966, v. 3, p. 40). For 
example, the probability of finding a significant difference where none 
exists is approximately equal to the product of the level of signifi- 
cance (©*) and the number of tests of significance computed (K) or 
Probability ■ © f - K . This formula is for independent sets of data. How- 
ever, it provides an approximation for interrelated data such as is 
present in this study. Consequently, even if a significance level of .01 
is used in each test of significance, the probability of finding a sig- 
nificant difference where none exists in 25 tests of significance would 
be approximately .01 x 25 * .25, or 1 in 4, rather than the anticipated 
1 in 100. 



CHAPTER IV 



Readability Measurement 



Introduction 



As Lorge (1949, p. 86) defined it: "The concept of readability 

involves,, the idea of understanding printed material." Attempts at 
measuring readability have been traced back hundreds of years (Klare, 
1963). 



More recently, Chall (1958b, p. 156-1-58) formulated seven major 
generalizations about readability measures from the "fundamental 
methodological research in readability": 

1. a variety of factors contribute to reading difficulty... 
content, stylistic elements, format, and organization.... 

2. ...only stylistic elements have been amenable to 
reliable quantitative measurement and verification. 

3. of the diverse stylistic elements .. .only four types 

can be distinguished: vocabulary load, sentence 

structure, idea density, and human interest. 

4. of the four types of stylistic elements, vocabulary 
load (diversity and difficulty) is most significantly 
related to all criteria of difficulty so far used. 
Vocabulary difficulty has to do with the reader's 
understanding of the individual words.... Vocabulary 
difficulty has been measured either by reference to 

a word list or by word length. . ... _ 

5. almost every study found a significant relationship 
between sentence structure and comprehension difficulty. 
The most popular method of J estimating sentence structure 
is by sentence length.... 
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6. readability formulas measure idea density only indirectly 

through the percentage of prepositional phrases and, less 
often, through the percentage of different content words.... 
Prepositional phrases have less potent influence on dif- 
ficulty than either vocabulary difficulty or sentence 
structure. They add little to the over-all predication of 
difficulty, once some measure of these two factors is 
included in a formula. '• 

7. human interest has been measured by number of personal 
pronouns, persons' names, and nouns denoting gender.... 
However, these measures add little to a readability formula, 
once vocabulary difficulty and sentence structure are used. 

Chall (1958b) and Klare (1963) desc^ribed the many readability 
formulae that had been developed through different combinations and 
weightings of the stylistic elements described above. Klare (1963) 
identified 31 formulae by 1960. Currently, new advances in linguistic 
theory, particularly the work of Noam Chomsky (1965), have prompted new 
developments in measuring readability, ^ 

The present study explored the concentration of the most potent 
predictors of relative "language" difficulty (described above) in 
selections, questions and choices. Toward • this end, the Dale-Chall 
(1948) and the Spache (1953) formulae were used. Two formulae were 
necessary since each formula had grade level limitations as do most 
formulae. The Spache formula was designed to rank reading matter from 
the grade 1 through the grade 3 level. The Dale-Chall formula was 
designed to rank reading matter from the grade 4 through the college 
graduate level. Thus the Spache is a more appropriate measure for the 
lower level tests; while the Dale-Chall is a more appropriate measure 



Bormuth (1967) discusses the areas of advancement in readability 
research: the use of cloze, developments in linguistic theory, and 
the use of finer statistics. 
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fov the higher level tests. 

Both formulae consist of a measure of vocabulary load (number of 
difficult words not on a given list) and a measure of sentence structure 
(average sentence length). Therefore, they seem more comparable to each 
other than to readability formulae consisting of other measures such as 
word length, or a different kind of readability element such as human 
interest. However, some differences do exist between the two formulae, 
which may account for their appropriateness to different grade levels. 
The Dale-Chall formula is based on the Dale List of 3000 Familiar Words ; 
while the Spache formula is based on Clarence H. Stone's Revision of the 
Dale List of 769 Easy Words . Furthermore, whi3.e the Dale-Chall formula 
weighs the number of times a difficult word occurs, the Spache formula 
counts difficult words only once. 

Another reason for using the Spache and the Dale-Chall formulae 
in the present study was that they appeared to be frequently used in 
appraising educational materials. A demonstration of their popularity 
was the frequent use of these formulae in appraising the difficulty of 
randomly selected comprehension skill builders reviewed in Chapter I 
(see Table 1, p. 5). Finally, the Dale-Chall formula seemed to be among 
both the formulae that correlated most highly with other-readability 
formulae, and the formulae that gave the most valid grade scores for 
juvenile fiction of intermediate difficulty (Chall, 1958b, p. 164). 

The readability analysis also provided a means of comparing rela- 
tive difficulty of test items, as rated by readability formulae, and the 



45 



empirical difficulty scores of the items provided by the test pub- 
lishers. Empirical difficulty refers to the percentage of pupils 
answering a given test item correctly. Test items generally included 
a reading selection, a question about that selection and answer choices 
to the question. 

Au exploration of the distribution of selection, question and 
choice readability scores follows. Comparisons were made among reading 
comprehension tests at three grade levels (1-2, 4-6, and 9-14) and in 
three test batteries (California Achievement Test , 1963; Gates-MacGinitie 
Reading Test , 1965, 1969 and Stanford Achievement Tests 1964, 1965). 

Also reviewed were the relationships of these readability scores to 
each other and to the empirical item difficulty scores. 

Procedure 

Readability Scores 

Two Harvard doctoral students independently counted the following 
variables for each: 

1. reading selection 

a. the number of words in the selection 

b. the number of sentences in the selection 

c. the number of words not on Clarence H. Stone*s Revision of 
the Dale List of 769 Easy Words (non-Spache) 

d. the number of words not on the Dale List of 3000 Familiar 
Words (non-Dale-Chall) 

2. question ' 

a. the number of words in the question 
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r. f 



the number of words not on Clarence H. Stones Revision 
of the Dale List of 769 Easy Words (non-Spache) 

the number of words not on the Dale List of 3000 Familiar 
Words (non-Dale-Chall) 

the number of words In the choice 

the number of words not on Clarence H. Stone's Revision of 

, .. 

the Dale List of 769 Easy Words (hon-Spache) 

c. the number of words not on the Dale List of 3000 Familiar 
Words (non-Dale-Chall) 

When both investigators f inished^ all the items in a given test, 
they compared their counts. If counts for any selection, question or 
choice conflicted, both investigators again counted that part of the 
item independently. Results were compared again. This procedure was 
repeated until agreement was reached. 

The "word counts" provided the data which were punched onto IBM 
cards. Further computation was conducted on the IBM 360/65.^ Randomly 
selected computations were checked with the Olivetti Programma. The 
following scores were computed for eac h: 

1. Reading selection 

a> average sentence length in the selection 

b. Spache ratio - number of non-Spache words in the selection 
divided by the total number of words in that selection 

c. Dale-Chali ratio - number of non-Dale-Chall words in the 
selection divided by the total number of words in that 
selection. 

^All the scores below were computed according to the specifications 
of the Dale-Chall (.1948) or the Spache (1953) formula. 



b. 

c. 

3. choice 

a. 

b. 
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) d. Dale-Chall raw score - Dale-Chall ratio (c above) 

! 

multiplied by the constant 0.1579 and added to average 
| sentence length multiplied by the constant 0.0496. 

A constant, 3.6365, was added to this aum.^ 

! 

* e. Spache grade score - average sentence length multiplied 

j by the constant 0.141 and added to the Spache ratio 

(b above) multiplied by the constant 0.086. A constant, 

J 0.839, was added to this sum. 

J f. Dale-Chall grade score - Dale-Chall raw score (d above) 

2 

was converted into corrected grade levels from a table. 

! 2. question 

a. Spache ratio - number of non-Spache words in the question 

- divided by the total number of words in that question 

| b. Dale-Chall ratio - number of non-Dale-Chall words in the 

question divided by the total number of words in that 
ques tion 



The constants for both the Dale-Chall raw score and the Spache 
grade equivalent resulted from a multiple-regression technique. For 
further information about the regression technique see Chall (1958b) 
and Klare (1963) . 

2 

The table of corrected grade levels provided by Dale and Chall 
(1948) consists of ranges. For example, a raw score of 5.0 to 5.9 
corresponds to a grade level range from 5th to 6th grade. For purposes 
of simplifying the computation, the midpoint of this range was used in 
computing means and standard deviations for this study. Thus, for 
the above range a grade equivalent of 5.5 was assigned to a raw score 
between 5.0 and 5.9. 
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3. choice 
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a. Spache ratio - number of non-Spache words in the choice 
divided by the total number of words in that choice 

b. Dale-Chall ratio - number of non-Dale-Chall words in 
the choice divided by the total number of words in that 
choice 

The grade scores and average sentence lengths were not computed 
for the questions and choices because they seemed inappropriate for 
two reasons. First, the test questions in this study were usually no 
longer than one sentence and the choices were often only one word, 

V 

providing essentially no data from which to compute average sentence 
length. Second, the readability formulae generally were standardized 
on reading selections of approximately 100 words in length (Chall, 
1958b, p. 171). Consequently, conclusions based on average sentence 
length where none existed, and on grade scores computed with multiple 
regression coefficients and constants obtained from 100 word samples 
would have been either extremely tentative or possibly even meaning- 
less.'*' Furthermore, although formulae have been generally accepted 

^Coef f i cien ts and constants of readability formulae result from 
the particular data sample used in the multiple regression analysis. 
Another data sample would produce different coefficients (Kendall and 
Stuart, 1966, v. 2, p. 355). However, formulae may be used for similar 
samples. There seem to be two types of similarity. One is the con- 
tent of the reading selection. The content of materials appraised by 
readability formulae should be similar to the content of the materials 
on which the formulae were standardized. Both the Dale-Chall and the 
Spache formulae were standardized, in part, on general type of school 
reading matter (Chall, 1958b, p. 39). The second type of similarity is 
the length of the selection. The length of materials appraised by 
readability formulae should correspond to the length of the materials 
on which the formulae were standardized - usually 100 words. Chall 
(1958b, p. 171) concluded that relative difficulty and especially grade 
placement determined for much shorter reading selections "should be 
considered tentative." 
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as valid estimates of relative difficulty of reading matter, their 
determination of exact grade-level difficulties have been questioned 
for a long time (Chall, 1956, p. 99). However, the more basic 
readability factors such as number of words, or number of non-Spache 
and non-Dale-Chall words, etc. may appraise the relative difficulty 
of questions and choices adequately.. 

Difficulty Scores 

Difficulty scores— percent right answers to each test question — 
of the standardization population were requested by mail from the 
publishers of the tests analyzed.^ Scores of the standardization 
population were requested because it seemed that these difficulty 
scores were a by-product of the standardization and norming procedure. 
Consequently the difficulty scores for the standardization population 
might have been most readily available. In addition, since most test 
authors presented descriptions of the standardization population 
in published technical reports, the need for gathering such descrip- 
tive information could have been eliminated. Furthermore, 
standardization populations usually consisted of carefully stratified 
national samples that were expected to represent the national school 
population reasonably well (California Test Bureau, 1957, p. 12; 

^Standardization and norming were conducted simultaneously. 
Possibly therefore the authors seemed to use the terms inter- 
changeably (California Test Bureau, 1957; Gates and MacGinitie, 

1965d; Kelley, et al , 1966). 
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Kelley, et al, 1966, p. 9).^ 

After many months, much correspondence and many telephone 
coiwersations , it appeared that the difficulty scores for standardiza- 
tion populations were not readily available. Only the SAT published 
difficulty scores of their standardization population for the 1-2 and 4-6 
grade level testsin the Technical Supplement (Kelley, .et al, 1966, 
p. 46-53) . The SAT published difficulty scores for a national 
pre-standardization try-out population at the high school level 
(Gardner, et al, 1965, p. 16) . The GMRT difficulty scores for all 
levels were from a "nationwide tryout that involved more than 25,000 

pupils (Gates and MacGinitie, 1965d, p. 2)." The CAT difficulty scores 

2 

for all levels were from a nationwide pre-revision investigation. 



The California Test Bureau (1957, p. 12) controlled for geographic 
region and community size in selecting standardization samples. With 
these controls the performance of the samples of pupils drawn was ex- 
pected to be an "accurate estimate of the performance of the total pupil 
enrollment in elementary and secondary schools in the United States." 

In norming the Stanford Achievement Test 

The distribution according to region and size of system was 
established in such a way that the norm sample would dupli- 
cate these characteristics for pupils in average daily 
attendance in public and private schools throughout the 
country .. .Public schools (integrated, segregated white and 
segregated Negro) , private non-sectarian and private sec- 
tarian schools were included in the sample. (Kelley, et^ al , 

1966, p. 9) 

The authors of the Gates-MacGinitie Reading Test also reported 
careful selection of the standardization population "on the basis of 
size, geographic location, average educational level, and average 
family income (Gates and MacGinitie, 1965d, p, 2)." 

2 

The California Test Bureau was planning a revision of the CAT in 
1968. In order to establish which items should be retained from the 
old form of the tests, a review of item difficulties was undertaken. 

The sample of item difficulties was taken from tests sent to the 
California Test Bureau for scoring by school systems around the country 
that were using the CAT (telephone conversation with Dr. W. E. Kline, 
Managing Editor, California Test Bureau, 1971). 
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Table 4 presents the nature of sample and the grades for which diffi- 
culty scores were available, the size of the sample, and the year of 
testing.^ - 

Although samples used to provide the difficulty scores were 
relatively large (see Table 4) and were also usually stratified 
according to geographic region and community size (Gardner, et_ al , 

1965, p. 16; MacGinitie correspondence, 1970; Kline conversation, 1971), 
there were no claims by authors that these samples represented national 
performance as well as the larger, more carefully selected standardi- 
zation samples. 

Treatment of Data 

Means, standard deviations, minimums, maximums and ranges of each 
readability score (see pp. 45-48 ) were computed for all selections, 
questions and choices of the 10 tests at the 3 levels and in the 3 
batteries analyzed. Means, standard deviations, minimums, maximums and 
ranges of difficulty scores were also computed for the 10 tests. These 
computations summarized the distribution of readability and difficulty 
scores in individual tests, at the three grade levels and in the three 
test batteries. 

In order to explore the correlations that existed among selection, 
question and choice readability scores and between readability and 

^Test publishers did not usually have item difficulty data avail- 
able for each grade for which the test was intended. In some tests all 
grades were represented at one level (e.g. CAT Elementary - grades 4,5 
and 6), but not at another (e.g. CAT Advanced - grades 9, 10 and 12 only). 
The difficulty scores for most tests were given by grade. GMRT , Survey F 
difficulty scores were for a combination of grades — 10, 11 and 12. 
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difficulty scores, a common unit of measure was established. The 
number chosen as the common unit corresponded to both difficulty scores 
and questions. For example, in establishing the common unit with the 
selections, all the data on a given selection (see pp. 45-47) were 
duplicated for each question referring to that selection. Table 5 
presents the mean, minimum and maximum number of questions that 
accompanied a given reading selection in each comprehension subtest 
analyzed. The number of questions accompanying a given reading selec- 
tion differed among grade levels and among tests. For example, 
reading selections on the CAT Lower Primary were accompanied by a mean 
number of 2.5 questions; while those on the CAT Advanced had an 
average of 9 questions each. The GMRT Level A, on the other hand, 
had only one question per selection; thus no weighting was needed. 

In order to establish a common unit with choices, the converse 
procedure was undertaken. A given score (e.g. number of words in the 
choice, see pp. 46 and 48 ) was added for the four or five choices that 
accompanied a question. The result of this duplicating and merging 
of data was a common unit for selection, question, choice and 
difficulty scores with which Pearson product-moment correlations 
were computed. 

The common unit was also used to compute selected weighted menns 
(Tables 13-18, and 20 in Appendix A). The weighted means 

^Table 3, p.40 , presented the number of selections, questions and 
choices analyzed. Usually there were a number of questions that re- 
ferred to one selection, and four or five choices that referred to one 
question. 
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approximated the relative importance of selections and choices in 
the tests. For example, if a pupil did not understand a given 
srVction, theoretically he was likely to get the questions with it 
wrong. Similarly, a pupil did not need to understand each choice in 
order to get a question right. Generally a pupil needed only to 
identify the right answer, eliminate the wrong answer, or give an 
educated guess about the choices in that "set." In any case, only 
one question was affected. 

Results and Discussion 

The readability data are presented and discussed here in the form 
of general conclusions about the four objectives of this study. 

Selected data tables are included in the text to follow; however, for 
the reader who is interested in more specific results, tables of 
unweighted and weighted means, standard deviations, minimums , maximums 
and ranges of readability and difficulty scores by test are presented 
in Appendix A, Tables 2-22. Tables of Pearson product-moment correla- 
tions among readability and difficulty scores by test are also 
presented in Appendix A, Tables 23-32. 

The first objective of this study was to characterize the nature of 
reading comprehension as tested at three grade levels (grades 1-2, 

4-6 and 9-14) , 

In order to determine the readability characteristics common to 
the CAT , GMRT , and SAT , the data of the three test batteries were 
combined for the lowest level tests (tests intended for grades 1-2), 



for the intermediate level tests (tests intended for grades 4-6) and 

for the advanced level tests (tests intended for grades 9-14).^ 

Table 6 presents unweighted means and standard deviations for the 

number of words in the reading selections, questions and choices, as 

well as the number of non-Spache and non-Dale-Chall words by test 
2 

level. The minimum, maximum and range for scores on Table 6 are 
presented on Table 22 in Appendix A. 

Eight generalizations applied to the readability scores: 

1. The higher the grade level of the test, the longer and harder 

3 

its reading selections, questions and choices. For example, as seen 
in Table 6, reading selections on the lowest level tests had a mean 
length of 18.51 words, the mean length of selections in the inter- 
mediate level tests was 64.71 words and the mean length for selections 
in the advanced level tests was 130.33 words. A similar trend of 
increases was observed for the other readability scores. 



Results of three different standardized tests have been used by 
the U.S. Office of Education in assessing performance contractors 
(Klein, 1971, p. 2). The assumption seemed to be that three tests give 
a more valid appraisal of relevant student performance than one test. 

The tests used in this study were combined only when means and 
standard deviations of the three test batteries showed similar trends. 
Scores showing contrary trends would tend to cancel each other out and 
distort interpretation. 

2 

Since data for each test item on the CAT , GMR T, and SAT were 
combined, tests with more selections, questions and choices were 
"weighted' more than shorter tests. Usually, this resulted in the SAT 
being "weighted" more than either the CAT or the GMRT . The GMRT was 
"weighted" more than the CAT , e.g. in the lowest level tests, the CAT 
had 15 test items, the GMRT had 34 test items and the SAT had 38. 

3 ( 

"Hard" refers to the number of non-Spache and non-Dale-Chall 

words which also determine the Spache and Dale-Chall ratios and grade 
scoi'es . 

It should be noted here that although length of reading matter 
appeared to discriminate among test levels consistently, it has not as 
yet been considered a factor in readability formulae. 
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Consequently, tests at higher levels seemed to require larger 
' and broader vocabularies. They also seemed to expect pupils to 

assimilate and process a greater amount of information. However, on 
the basis of the readability analysis it is not possible to charac- 
terize the vocabulary or the information pupils were expected to 
understand. 

2. The differences in lengths and number of hard words of reading 
selections, questions and choices from the lower level to the 
intermediate level tests were consistently greater than the dif- 

\ 

I ferences from the intermediate level to the advanced level tests. 

i For example, the mean lengths of questions in tests intended for 

i 

grades 1-2 was 6.23 words. The mean length of questions in tests 
| intended for grades 4-6 was 18.13 words, and the mean length of ques- 

tions in tests intended for grades 9-14 was 18.85 words. 

i 

] Thus, the average question in the intermediate level tests was 

| almost 3 times as long as the average question in the lowest level 

tests. But the average question in the advanced level tests was almost 
| the same length as the average question in the intermediate level 

tests. Although magnitudes differed, similar relationships appeared 
| among selection and choice lengths, as well as among hard word scores 

for the selections, questions and choices. 

Size of the increase from test level to test level seemed to 
depend on the readability score and the part of the item, e.g. 
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average question length, as noted above, increased 3 times from the 
lowest level to the intermediate level tests while hardly at all 
from the intermediate level to the advanced level tests. Average 
selection length similarly increased about 3 times from the lowest 
level to the intermediate level tests. But it also increased about 
2 times from the intermediate level to the advanced level tests. 
Furthermore, the average number of non-Spache words in the selection 
increased by 8.5 times from the lowest to the intermediate level 
tests and 2.5 times from the intermediate to the advanced level 
tests. 

3. Usually, the higher grade level tests contained selections, 

questions, and choices with more diverse lengths and more diverse 
numbers of hard words. For example, the larger the mean of the 
number of non-Spache words, the larger the standard deviation of 
the number of non-Spache words. The mean for non-Spache words in 
the selection of tests intended for grades 1-2 was 1.19 with a 
standard deviation of 1.36. In tests intended for grades 4-6, the 
mean number of non-Spache words in the selection was 16.19 and the 
standard deviation was 12.50. Tests intended for grades 9-14 had a 
mean number of 41.08 non-Spache words in their selections and a 
standard deviation of 40.29. Similar increases in standard deviations 
existed for questions and for the other selection, question and choice 
readability scores on Table 6.^ 

^The one exception that existed was that the number of words in 
the choices of the lowest level tests had a slightly higher standard 
deviation than the intermediate level tests. This may have been due 
in part to the fact that on lowest level tests some choices were 
pictures without words and other choices were pictures with many words. 



Thus, selections, questions and choices of lower level tests 
had more uniform lengths and numbers of hard words than higher level 
tests. Although this generalization applied to the number of hard 
words, it was not always consistent with the ratios of hard words to 
the number of words in the selections, questions or choices. Table 7 
presents ratios of the number of sentences to the number of words in 
the reading selections (average sentence length), the number of non- 
Spache words to the number of words in the reading selection (Spache 
ratio), and the number of non-Dale-Chall words to the number of words 
in the reading selection (Dale-Chall ratio) . Weighted means for the 
ratios are presented in Table 17, Appendix A. Corresponding ratios 
for questions and choices are on Tables 19 and 20 in Appendix A. 

Table 7 also presents grade scores for the two readability formulae. 

4. With some exceptions the selections, questions and choices 

in tests intended for grades 9-14 had more uniform Spache and Dale- 
Chall ratios than tests intended for grades 4-6. For example, the 
standard deviation for the Spache ratio of selections in the inter- 
mediate level te'its was 8.30, while the standard deviation for the 
Spache ratio of selections in the advanced level tests was 6.18. 

Thus, the number of words, the number of hard words, and the diversity 
of these counts were greater in the reading selections , questions and 
choices of tests intended for higher grade levels than in tests 
intended for lower grade levels. However, the diversity of hard word 
ratios (Spache and Dale-Chall ratios) in the selections, questions and 
choices did not consistently increase. 
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5. The ratios of hard words in the reading selections and 
questions were more similar to each other than to the ratio of hard 
words in the choices. For example, the adv&arsd level tests had a 
mean Spache ratio of 33.70 for the selections and an average Spache 
ratio of 35.66 for the questions (Table 19, Appendix A). However, 
the average Spache ratio for the choices was much larger, 67.75 (Table 
20, Appendix A). The reason for these similarities and differences 
may be that while selections and questions were usually made up of 
sentences that included simple words like articles and conjunctions, 
choices were usually few isolated words of more uniform difficulty. 

No one seems to know what the ratio of hard words in reading 
selections of tests, school readers, or textbooks should be. 

The Dale-Chall ratio indicated the frequency of hard words in a 
given reading selection, question or choice.^" Thus, according to the 
Dale— Chall ratio, reading selections had about 1 hard word in 200 in 
tests intended for grades 1-2, about 1 hard word in 10 in tests intended 
for grades 4-6, and about 1 hard word in 4 in tests intended for grades 
9-14 (Table 7, p. 61). Questions had a similar progression, though 
usually higher frequencies, e.g. about 1 hard word in 100 words in lower 
level tests, about 1 hard word in 6 words in intermediate level tests 
and about 1 hard word in 4 words on advanced level tests (Table 19, 

^Non-Spache words do not appear on the International Kindergarten 
Union List and the first 1000 words of Thorndike's Teachers Word Book 
(Spache, 1953). These lists probably include most words known by 1st 
graders. Non-Dale-Chall words do not appear on the Dale List of 3000 
Familiar Words . This list includes words known by 80% of a sample of 
4th graders. Since 4th graders understand more words than most 1st 
graders, non-Spache words include more "simple" words than non-Dale-Chall 
words. Also since 4th grade reading level is the established literacy 
criterion in this country, words not known by most 4th graders may be 
viewed as generally difficult. 
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Appendix A). Choices had a similar progression, but even higher 
frequencies, e.g. about 1 hard word in 50 on lowest level tests, 1 
hard word in 4 on intermediate level tests, and 1 hard word in 2 
on advanced level tests.^ 

Before the discussion of the underlying meaning of readability 
scores, three more findings are presented. 

6. Grade scores of the two readability formulae were not identical. 

As noted above the Spache formula was intended by its author only for 
grades 1 to 3. The Dale-Chall formula was intended by its authors only 
for grades 4 and above. Consequently, the Spache grade score for the 
intermediate and advanced test levels and the Dale-Chall grade score 
for the lowest test level do not refer to the grade level of children 
for whom thesfe tests are appropriate. These grade scores were presented 
only to demonstrate the relationship existing among the test levels 
and between readability formulae. The remaining scores however may give 
an indication of appropriate grade level. 

The predictions made by readability formulae are generally 
accurate and reliable within the range of about one year (Chall, 1958b; 
Dyer, 1971). Thus, the Spache appraisal of the lowest level test (2.29) 
seemed to correspond to the grades for which the authors intended the 
test. The Dale-Chall appraisal of the intermediate level test (8.02) 
seemed higher than suitable for the grades intended. And, the 
Dale-Chall appraisal of the advanced level test (12.91) appeared 
to correspond to the grades for which the tests were intended. One 
reason for differences between readability and test-authors’ appraisals 

1 Test-authors generally intend that reading comprehension tests 
resemble reading matter in school books. In fact, one test-author 
suggested that a school intending to use his tests, "examine its own 
curriculum and the test content ..to ascertain whether or not the latter 
satisfactorily covers the former (Kelley, e£ al_, 1966, p. 23)." 
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may be the criterion used in establishing "grade appropriateness." 
j Readability formulae used items passed by 50%-75% of the pupils in 

a given grade. Test-authors sometimes used items passed by only 26% 
of the pupils in a given grade. Furthermore, although the mean grade 
score of items on a given test seemed to correspond to the grades for 

i 

which the test was intended, little can be noted about the "grade 
appropriateness" of individual selections. 

7. Weighting readability scores by the number of questions 
had little effect on the relationships among levels. Mostly the 
direction of relationships remained the same although the sizes of 
relationships were somewhat increased or decreased. For example, the 
unweighted means for the number of sentences in the reading selections 
at the three test levels analyzed were 3.12, 5.88 and 9.51 (Table 3, 
Appendix A) . The weighted means for the same three levels on the same 

•, score were 2.36. 6.11, and 10.84 (Table 14, Appendix A). 

l 

i 

The greatest proportional effect of weighting scores seemed to 
| be in the reduction of means on the lowest level tests. This may have 

i 

been due to the inclusion of "0" scores for the test items at that 

I 

1 level which had no selections. 

8. One interesting side note is the relationship between adjacent 
level tests. The only adjacent level tests in this study were the SAT 

j Intermediate I, intended for grade 4, and the SAT Intermediate II 

intended for grades 5 and 6.^ 

^All other tests skipped in-between levels. For example, between 
the CAT Lower Primary and Elementary tests analyzed in this study, is a 
CAT Upper Primary not analyzed in this study. 
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The increases in the means of readability scores were generally 
as consistent in the adjacent tests as in tests where levels were 
skipped. However, the SAT Intermediate I more frequently had higher 
minimum and maximum scores than the SAT Intermediate II. For example, 
the mean number of words in the reading selections of the SAT Inter- 
mediate I was 64.79, and of the SAT Intermediate II was 67.04. 

Although the average reading selection on the SAT Intermediate II was 
a bit longer, both the shortest and longest selections on this test 
were shorter than the corresponding selections on the Intermediate I, 
e.g. the longest reading selection on the Intermediate I had 161 words, 
and on the Intermediate II had only 127 words. The shortest reading 
selection on the Intermediate I test had 13 words and the shortest 
selection on the Intermediate II had 12 words. Such reversals 
occurred infrequently when in-between test levels were skipped, e.g. 
the highest Spache ratio for a reading selection in the SAT inter- 
mediate level tests was 50.00, while the highest Spache ratio for a 
reading selection on the SAT advanced level test was only 39.39 (Table 
17, Appendix A). 

On the whole, the reading comprehension subtests analyzed consis- 
tently increased in the number of words as well a6 in the number and 
ratio of hard words in the average reading selections, questions and 
choices at each higher test level. Generally, differences between 
levels were not uniform. Greater differences in readability scores 
existed from the lower to the intermediate than from the intermediate 



to the advanced level tests. 
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The significance of these findings rests not only in the scores 
themselves. Readability scores reflect more fundamental factors 
about the language used to write reading selections, questions and 
choices. Horn stated: 

...difficulty of vocabulary is tied up with the 
remoteness of the concepts from the reader's 
experience; and a large number of different words 
and long involved sentences are related to the 
complexity of. the ideas presented. (Horn, 1937, 
p. 170) 

Therefore, pupils were tested on more "remote concepts" and on 
more "complex ideas" in the selections, questions and choices of 
higher level tests. Also, the present analysis suggested that 
"remoteness" and "complexity" increased more from the lowest level 
tests to the intermediate level tests, than from the intermediate to 
the advanced level tests. Further investigation was undertaken to 
determine whether or not these differences were consistent in dif- 
ferent reading comprehension subtests. 

The second objective of this study was to characterize the 
nature of reading comprehension as tested, by different test batteries . 

The readability scores of the reading comprehension subtests of 
three widely used test batteries (California Achievement Test , 1963; 
Gates-MacGinitie Reading Test . 1965, 1969; and Stanford Achievement 
Test f 1964, 1965) were contrasted. Table 8 presents unweighted means 
and standard deviations for the number of words in the selections, 
questions and choices. Minimums, maximums and ranges for these scores 
can be found in Tables 2, 6, and 9 in Appendix A. 
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^ny of the choicee were pictureo without any wordB. Some choices were pictures with words 



Four generalizations were made on the basis of the analysis by 
test batteries. 

1. Findings about test levels within each battery were similar to the 
findings about test levels when batteries were combined: 

a. the higher the grade level within each battery, the 
longer and harder its reading selections, questions 
and choices. 

b. the differences in lengths and number of hard words in 
the reading selections, questions and choices from the 
lowest level to the intermediate level tests were pro- 
portionately greater than the differences from the 
intermediate to the advanced level tests. 

c. the higher grade level tests usually contained selections, 
questions and choices with more diverse lengths and more 
diverse number of hard words. More specifically, the 
higher the means of these scores, the higher their standard 
deviations . 



The following considerations should be noted in applying these 
generalizations to choices. First, the GMRT , Level A had picture 
choices with and without words. Second, the range for the number of 
words in the choices of the 3 test batteries varied. The CAT choices 
ranged from 1 to 19 words (Table 9, Appendix A). The SAT had a similar 
range of 1 to 14 words. However, the choices on the GMRT ranged only 
from 1-3 words, and in Surveys D and F, the choices were almost uni- 
formly 1 word. 

Another exception was the number of words in the questions. Both 
the CAT and GMRT had fewer words in the questions of the advanced level 
test than in the questions of the intermediate level tests (Table 6, 
Appendix A) . 

The data from which these generalizations were made are in Tables 2-22, 
Appendix A. 



69 



d. more often than not, selections, questions and choices 
in tests intended for grades 9-14 had more uniform 
Spache and Dale-Chall ratios than tests intended for 
grades 4-6. 

e. the ratios of hard words in the reading selections and 
questions were more similar to each other, than to the 
ratio of hard words in the choices. 

For example, selection length increased from lower level to higher 
level tests in each battery, e.g. on the CAT the average length for 
reading selections was 23 words at the lowest level test , 198 words 
at the intermediate level test and 418 words at the advanced level 
test. Similarly, on the GMRT , reading selections had an average 
length of 20.38 words at the lowest test level, 42.81 words at the 
intermediate level and 57.33 words at the advanced level. In keeping 
with this pattern, the SAT reading selections had an average of .17.06 
words at the lowest test level, 65.91 words at the intermediate level 
and 137.38 words at the advanced level. Generally, question and 
choice lengths increased from level to level in each test battery as 
well. Other readability scores usually also increased with higher 
test level. 

However, as seen from thesodata, the average length for reading 
selections at a given level was not the same in the three test batteries 
analyzed; nor, generally, were the increases from level to level the 
same. 

2. The CAT, GMRT and SAT usually differed in the average number of 
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words and number of hard words they contained in reading selections, 
questions and choices at a given grade level. For example, on the 
advanced level tests reading selections in the CAT were longest with 
an average length of 418 words. Reading selections in the GMRT were 
shortest with an average length of 57.33 words. Reading selections in 
the SAT had an in-between average length of 137.38 words. 

Among the 3 test batteries analyzed, the GMRT most often had the 
longest questions with the most hard words. For example, on the 
advanced level tests, the questions in the CAT were shortest with an 
average of 13.64 words. ^ The questions in the GMRT were longest with 
an average of 25.02 words. And, the questions in the SAT were in- 
between with an average of 17.52 words. 

In part , this may have been due to the fact that questions on 
the different tests often took different forms. The CAT always had 
separate questions in either sentence completion or direct question 
form. The GMRT , Primary A had either regular questions or questions 
Implicit in the selection, e.g. a direction telling a pupil to mark 

one of the choices, or a description of one of the picture choices 
which the pupil was expected to mark. In Surveys D and F, the GMRT 

always had cloze-like blanks in the reading selections. The SAT had 
questions in all these forms. 

"''The CAT Lower Primary and Elementary were exceptions since 
instructions on how to respond immediately preceded the questions. 

When such instructions were not read out loud by the teacher and were 
necessary for getting the question right, they were added to the 
readability counts of the question. The questions were thereby 
artificially lengthened at these levels. 
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3. The increases in readability scores from test level to next 
higher test level differed for the CAT , GMRT and SAT . For example, 
the reading selections of the intermediate level CAT were over 8 
times as long as those on the lowest level CAT . The intermediate 
level GMRT had reading selections about twice as long as the lowest 
level GMRT. And intermediate level SAT reading selections were about 
4 times as long as lowest level SAT reading selections. Similar 
differences in increases existed among the other readability counts 
of selections, questions and choices (Tables 3-20, Appendix A). 

Despite these differences in readability counts among tests, 
some similarities were observed in ratios. 

4. The average sentence length in the selections as well as ratios 
of hard words to total words in the selections and questions were 
relatively uniform among tests. For example, on the intermediate 
level questions, the CAT had a Spache ratio of 24.76, the GMRT had a 
Spache ratio of 26.89 and the SAT had an average Spache ratio of 
25.94 (Table 19, Appendix A). 

This was not the case with choices. For instance, on the 
advanced level test the CAT had a Spache ratio of 50.97, the GMRT had 
a Spache ratio of 84.99 and the SAT had a Spache ratio of 67.30. 

Interesting patterns emerged from the relationships found among 
batteries. For example, Figure 1 demonstrates the ranks of the CAT , 
GMRT , and SAT on the average number of words in the reading selections, 
questions and choices.^ 

^All levels analyzed within one test battery, e.g. CAT , were com- 
bined. Although ranking the scores tends to magnify the differences, 
it clarifies the general patterns. 
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Figure 1* Ranks of three reading ccmprenension tests on the number of words in their items* 
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Characteristically, the CAT had longer selections and choices 
but shorter questions than the GMRT or the SAT . Characteristically, 
the GMRT had shorter reading selections and choices, but longer 
questions than the CAT or the SAT . The SAT scores seemed consistently 
in-between those of the CAT and the GMRT . More often than not, the 
rank for the levels combined accurately reflected the rank at any 
given level. For example, at the advanced level, the CAT had longer 
reading selections and choices, but shorter questions than the GMRT 
or SAT . The GMRT at that level had the opposite, i.e. shorter 
selection and choices, but longer questions than the CAT or SAT . 

The SAT again remained between the CAT and GMRT . 

Although similar patterns emerged for other readability counts, 
usually opposite patterns occurred for readability ratios, i.e. when 
the rank of readability counts in a test went up, the rank of the 
ratios in that test went down and vice versa . * For example. Figure 2 
presents the rank of the advanced level GMRT , in comparison to the 
advanced level CAT and SAT , on the number of words, the number of 
non-Dale-Chall words and the Dale-Chall ratios in the reading 
selections, questions and choices. 

Of the subtests analyzed at the advanced level, the GMRT had the 
fewest words, the fewest hard words (non-Dale-Chall words) but the 
highest proportion of hard words (Dale-Chall ratio) in the selections 
and choices. The reverse was true for questions, i.e. questions had 

^Inconsistencies in the patterns occurred among the ranks of the 
tests intended for grades 1-2 and among question readability 
scores. 
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the most words, the most hard words but a smaller proportion of 
hard words. Similar patterns emerged for other test levels. 

As has been shown, the readability scores of the CAT , GMRT , and 
SAT differed. The differences tended to fall into characteristic 
patterns. Tests with longer selections and choices had shorter 
questions. Tests with more words and more hard words in their 
selections, questions or choices had a lower proportion of hard words. 

The third objective of this study was to identify factors that 
may contribute to difficulty in tested comprehension. 

The empirical criteria of the difficulty of test items were the 
difficulty scores provided by test publishers. These difficulty 
scores represented the percentage of pupils who answered a given item 
correctly (see p. 49). Means, standard deviations, minimu'ms , maximums 
and ranges of difficulty scores by grade and test are presented in 
Table 21, Appendix A. 

The distribution of difficulty scores led to the following 4 
observations : 

1. Generally the minimum and maximum difficulty scores increased for 
each higher grade included in a given test level. For example, the 
intermediate level CAT was intended for grades 4, 5, and 6. In the 
4th grade the minimum percentage of pupils passing a given test item 
was 12.3; while the maximum percentage was 81.5. In the 5th grade 

the minimum difficulty score rose to 18.5 and the maximum rose to 92.4. 

2. Although increases occurred generally, both the size of the mini- 
mum and maximum difficulty scores and the size of the increases were 
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different for the CAT , GMRT and SAT . For example, in the GMRT the 
4th grade difficulty score minimum was 2.2 and the maximum was 
92.0. In the 6th grade GMRT the difficulty score minimum was 22.0 
while the maximum was 95.7. Thus, the difference in minimum diffi- 
culty scores from the 4th to the 6th grade CAT was 11.5 (see CAT 
scores in "1" above), while the difference in minimum scores from 
the 4th to the 6th grad.e GMRT was 19.8. 

3. The CAT , CMRT and SAT had different ranges of difficulty scores. 
The CAT generally had the narrowest ranges of difficulty scores — 
the smallest range was 46.1 and the widest range in that battery was 
69.2. The GMRT generally had the widest ranges of difficulty scores. 
The smallest GMRT range was 73.7 and the widest range was 89.8. The 
SAT had in-between ranges of difficulty scores; 60.0 was the smallest 
in that battery and 77.0 was the widest. 

4. The CAT , GMRT and SAT also had different means for difficulty 
scores. Mean difficulty scores on the CAT went from 26.0 to 64.2, 
on the GMRT from 48.5 to 71.7 and on the SAT from 43.3 to 61.3. 
Generally then, the CAT had the lowest means, the GMRT had the 
highest means and the SAT had means that fell in-between. 

The differences among difficulty scores reflected the differences 
in the criteria used by the constructors of the three test batteries. 
For example, while the designers of the CAT chose comprehension test 
items that were passed by an average of 262 to 642 of the pupils at 
a given grade, the designers of the GMRT seemed to prefer comprehension 
test items passed by a greater percentage of pupils — an average of 
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48. 5Z to 71. 7Z of pupils at a given grade. The differences among 
difficulty scores may also reflect the particular sample of pupils 
who took the tests. Different groups of pupils are likely to 
produce different difficulty scores. 

Due to these differences, difficulty scores were studied only 

"Hi 

by individual tests (for correlations of readability and difficulty 
scores by test see Tables 23-32 in Appendix A). Four generalizations 
were made on the basis of the correlation analysis. 

1. The highest correlations existed consistently among the difficulty 
scores themselves. Hence, it seemed that factors which made an item 
difficult at one grade level tended to be very closely related to 
factors which made the same item difficult at another grade level. 

For example, in the CAT intermediate level test (Table 24, Appendix A), 
which is intended for grades 4, 5 and 6, the correlation coefficient 
for the difficulty scores of grade 4 and 5 was .98. The correlation 
coefficient for difficulty scores of grade 3 and 6 was .95. Correla- 
tions of difficulty and readability scores were lower. The highest 
correlation coefficient for a difficulty and a readability score was 
.91—the correlation of the items* difficulty scores in grade 4 and 
the number of non-Dale-Chall words in the selections. 

2. Although the difficulty scores for different grades were consistently 
and highly intercorrelated , the difficulty scores for one grade cor- 
related quite differently with readability scores than the difficulty 
scores for the other grades. For example, on the lowest level CAT 
(Table 23, Appendix A ) the correlation coefficients of difficulty and 
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readability scores for the 1st grade were often more than twice as 
large as the correlation coefficients for the 2nd grade, e.g. 

1st grade difficulty scores and number of words in the selections 
had a correlation coefficient of -,67. Second grade difficulty scores 
and the number of words in the selections had a correlation coefficient 
of 7 29. 

I 3. Difficulty scores qo unrelated "most highly" with different read- 

ability scores in the 10 subtests analyzed. For example, on the 
! lowest CAT (Table 23, Appendix A), difficulty scores correlated most 

j highly with the following readability scores: Dale-Chall grade score 

■I 

1 v 

for the selection (*?71) , the number of sentences in the selection 
S (“71) and the number of words in the question (-.70). At the inter- 

mediate level, the CAT difficulty scores (Table 24, Appendix A) 

1 correlated most highly with different readability scores: number of 

j' non-Dale-Chall words in the selection (-.91, -;88, *?88) , and the number 

1 1 

of non-Spache words in the selection (-j90, “87, ■^ 87 ). Furthermore, 

i 

1 the GMRT’s difficulty scores at the Intermediate level (Table 27, 

^ Appendix A) correlated most highly with another combination of 

' readability scores: Dale-Chall grade score for the selection (t 60 and 

| -62) , Dale-Chall ratio in the selections (-;55 and -r56) , questions 

2 

(-51 to i55) , and choices (i53 and -:54) . 

^"The number of correlation coefficients corresponds to the number of 
grades for which difficulty scores were available, i.e. grades 4,5 and 6. 

2 

The correlation coefficients correspond to correlations of the 
readability scores with difficulty scores for grades 4 and 6. 
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Correlation coefficients demonstrate the relationship between 
two groups of scores. It has been shown that difficulty scores do 
not consistently correlate with the same readability scores in the 
different subtests analyzed. In order to find the effect these 
readability scores have in determining difficulty scores in a given 
test, either many partial correlations would have to be computed or 
a multiple regression analysis would have to be conducted. 

4. Generally the difficulty scores had higher correlations with 
the selections' readability scores than with the readability scores 
of either questions or choices. For example, correlation coefficients 
of the CAT advanced level test (Table 25 , Appendix A) , ranged from 
.04 to -55 for difficulty scores and selection readability scores; 
from .02 to t21 for difficulty scores and question readability scores; 
and from.01 to ,20 for difficulty scores and choice readability scores. 
Similar relationships appeared in other tests as well. 

In order to find the effect that selection, question or choice 
readability scores have in determining difficulty scores, a multiple 
regression analysis would have to be conducted. However, generally, 
the correlation coefficients between readability and difficulty scores 
were not very high. For example, only the CAT elementary and inter- 
mediate level subtests had correlation coefficients for readability 
and difficulty scores that were over .70. Usually these coefficients 
were much lower (see Tables 23-32, Appendix A). Therefore, it appears 
that factors other than "readability" also influence item difficulty. 



The fourth objective of this study was to characterize the nature 



of tested reading comprehension * 

On the basis of the readability analysis , four conclusions about 
tested reading comprehension seemed appropriate: 

1. Comprehension was usually tested by longer selections, questions, 
and choices at more advanced levels . Also at more advanced levels , 
the selections, questions and choices contained more hard words and 

a greater ratio of hard words to the number of words. Furthermore, 
the increases in readability ucores from level to level were not 
uniform. Greater increases appeared between lower levels than 
between higher levels on almost all scores. 

2. Comprehension appeared to be tested somewhat differently by the 
CAT , GMRT and SAT . While one test battery had more words, more hard 
words and relatively small hard words/number of words ratio in its 
selections and choices, another battery had fewer words, fewer hard 
words and a relatively high hard words/number of words ratio in its 
selections and choices. The subtests at the 3 levels analyzed 
within each battery however, seemed to be relatively consistent. 

3. Empirical difficulty of comprehension test items seemed to be 
"most correlated" with a different set of readability scores in each 
subtest. Usually item difficulty was more related to readability 
scores of selections than readability scores of questions or choices. 

4. Empirical difficulty of tested reading comprehension did not seem 
very closely related tc readability factors in general. While on some 
subtests, item difficulty and readability scores were highly correlated, 
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i 




on most subtests, the correlations were not very high. This finding 
j suggested that other factors may heavily influence the difficulty of 

tested reading comprehension. 

The readability analysis presented thus far has outlined 

j "stylistic elements" of reading comprehension tests. The fact that 

i 

selections, questions and choices get longer and have more words 
| suggested that the ideas presented became more complex and unfamiliar. 

. However, what the nature of the complexity and unfamiliarity was has 

' not been revealed.* - In the section to follow the content of the sub- 

| ject matter as well as the tasks to be performed on reading compre- 

I 

hension subtests were studied. 

i 

I 

) 

i 



j 



Chall (1958b, p. 156) stated that according to the "judgment of 
experts (tvachers, librarians, and publishers) and readers. . .content, 
stylistic elements, format, and organization contribute to difficulty. 

Stylistic elements are represented by the readability analysis. 

Some description of the format of tests and test items has also been 
included in this chapter. 
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CHAPTER V 



Task Measurement 
Introduction 

Tacks for comprehension are empirically established by reading 
comprehension test items. Successfully performing the task means 
answering the test item correctly. The knowledge and behavior 
necessary to perform well on reading comprehension tests has not as 
yet been clearly specified. 

Wittgenstein suggested that we "think of words as instruments 
characterized by their use (in Chihara and Fodor, 1966, p. 388)." 

By use he meant, according to Pitcher (1964), saying or writing the 
word, following directions involving the word, fetching or drawing 
what the word represented and also discriminating the object the word 
represented from other objects. According to Wittgenstein, accu- 
rately using words in these ways demonstrates one's understanding. 

Although Wittgenstein's concern was with the meanings of words, 
the same strategies for demonstrating understanding may also apply 
to larger language units such as sentences, paragraphs and stories. 

In fact the strategies he suggested resembled, the questions developed 
by Thorndike 'chat still are used in modified form on current 
standardized reading comprehension tests. For example, on several 
items in the lowest level standardized comprehension subtests 
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analyzed, pupils are required to mark the picture that corresponds 
to a word in a direction, e.g« "Mark the cat ( GMRT , Level A, item 1)." 
The direction is accompanied by four picture choices 1) a cat in front 
of a fireplace 2) a chair 3) a chair in front of a fireplace 5) a dog. 
Such items require the pupil to follow directions that involve the 
object and to perforin the behavior described by the words, as well as 
to discriminate the object from other objects. The more advanced 
comprehension tests analyzed required the pupil to discriminate the 
correct answer from among a number of possible answers (multiple- 
choice distractors) . 

Thorndike indicated that the comprehension task as he defined 
it was determined by aspects of the reading selection, the question 
and the responses. Responses on nearly all comprehension test items ana^ 
lyzed are now restricted to discriminating among multiple-choice 
answers. Therefore, in analyzing the tasks, this study investigated 
qualitative aspects of the reading selections, the questions and the 
multiple-choice answers. 

A rating scale for reading selections, a rating scale for 
questions and a rating scale for choices were devised to categorize the 
item tasks,. Much effort was put into specifying sufficient criteria 
for the ratings so that raters could easily agree, 

The rating scales will be presented in the procedure section 
to follow but a brief description of them is presented here. The 
reading selections were rated according to topic or content. Eight 
topics were defined and included in the selection rating scale: 
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1) riddle 2) story 3) language 4) math 5) social studies 6) social 
science 7) science 8) humanities.^ A "0" category was also included 
for those test items that had no reading selections. 

The scale for questions rated the relationship between the way 
information was presented in the reading selection and the way the same 
information was presented in the correct answer. Nine categories were 
defined and included in the question rating scale; 1) recognition 

2) contextual paraphrase 3) grammatical paraphrase 4) semantic para- 
phrase 5) definite concepts 6) probable concepts 7) language concepts 
8) previous knowledge 9) word-picture matching. When appropriate, the 
last category was combined with one of the preceding categories. For 
example, if the picture in the choice was not clearly described in the 
selection or question, but was probably the best choice on the basis of 
Indirect clues given, the item was rated as "picture-matching” and 
"probable concept." 

Lastly, the scale for choices rated the relationship of distrac- 
tors to the selection, to the question and to the correct answer. In 
all, 12 ratings were defined on the scale: 1) other 2) grammatical 

3) associative 4) associative-grammatical 5) categorical 6) categorical- 
grammatical 7) textual 8) textual-grammatical 9) textual-associative 
10) textual-associative-grammatical 11) textual-categorical 12) textual- 
categorical-grammatical. Generally, "grammatical 11 referred to whether 
or not a given distractor was a grammatical answer to the question. 
"Associative" referred to a slight semantic or conceptual relationship 

■^The categories for selection, question and choice rating scales 
were identified during a preliminary study of a number of reading com- 
prehension subtests at different levels by Auerbach (1970). Generally 
the categories included in the scales reflected the variety of items 
on reading comprehension tests. 
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of a given distractor to the correct answer choice, i.e. the distrac- 
tor represented a feature or function of the word represented by the 
correct choice. "Categorical" referred to a closer semantic or con- 
ceptual relationship of a given distractor to the correct answer 
choice than "associative", i.e. the "categorical" distractor was the 
same kind of object as the correct answer choice. "Textual" simply 
meant that the distractor was a word in the reading selection of the 
test item. And, "other" meant that the distractor was none of the 
above; essentially the distractor was irrelevant to the item. 

As with readability scores, the significance of these rating scales 
is not immediately apparent. The scale categories tend to reflect more 
fundamental factors about the test items. Each scale reflects to some 
extent, contextual, grammatical, semantic and conceptual aspects of the 
items, e.g. the scale for questions rates whether or not the information 
is given explicitly, implicitly or not at all in the context, whether or 
not grammatical changes have occurred in the words presenting the infor- 
mation, whether or not semantic changes have occurred in the words pre- 
senting the information and whether or not the concepts involved are 
general or academic. 

An exploration of reading comprehension tasks follows. Comparisons 
were made among the three test levels and the three test batteries. 

Procedure 

Each reading selection, question and distractor in the CAT , GMRT 
and SAT was coded independently according to the following rating scales: 



RATING SCALE FOR SELECTIONS 



Code and- Definition 



0 



■ No selection 



a 



1 ■ Riddle selection 



2 ■ Story selection 



3 “ Language selection 

4 * Math selection. 



“ the selection is a description or clue 
given to help the pupil identify a common 
object, act, etc. e.g. "I play with my new 
toy. It is a 1) ball 2) something 
3) little 4) play (Stanford Achievement 
Test, Primary 1, Form X, Paragraph Mean- 
ing, 1)." 

- the selection is about relatively common 
occurrences, events, people; not 
academically oriented. 

- the selection is primarily about language 
usage or literature. 

- the selection is about mathematics or 
requires mathematical concepts. 



5 • Social studies selection - the selection is about history, geography, 

etc. 



6 ■ Social science selection - the selection is about psychology, sociology, 

anthropology , etc . 

7 ■ Science selection - the selection is about general science, 

chemistry, biology, etc. 

8 ■ Humanities selection - the selection is about philosophy, art, 

theology, etc. 



Some items have only questions and choices and no selections, e.g. 
"Which is the big tree (Gates-MacGinitie Reading Test , Primary A, 

Form 1, Comprehension 2)?" The question is followed by four choices, 
one of which is a big tree. 



RATING SCALE FOR QUESTIONS 
Code and Definition 



1 « Recognition : 

Choosing the right answer requires recognizing an identical 
word that appears in the selection in the same general context. 

ex: The hedgehog of the Old World is a small mammal 

similar to a porcupine. When it is in danger it 
rolls itself up into a ball so that it resembles 
a pincushion and is protected by its sharp quills. 

The Old Worid hedgehog is a 

porcupine pincushion plant mammal 

(Gates-Ma c Ginitie Reading Test , 
Primary C, Form 2, Compre- 
hension 17A) 



2 » Contextual paraphrase 

Choosing the right answer requires recognizing an identical word 
that appears in the selection in a different linguistic context. 

ex: In the tropics, bacteria grow so rapidly that they 

quickly destroy rotting plant matter, called humus, 
in the soil. Tropical soils have little 

iron humus soil growth 

( Gates-MacGinitie Reading Test , 
Primary E, Form 3, Comprehen- 
sion 31) 

3 ■ Grammatical paraphrase 

Choosing the right answer requires recognizing a grammatical variant 
(different number, voice, tense, etc.) of a word that appears in the 
selection in a different linguistic context. 

ex: We all inspire and expire when we breathe. Inspiration 

is the act of taking into ourselves something which is 

not a part of us. is the act of giving back 

what we have thus obtained.... 

1. Expire 3. Expiration 

2. Inspire 4. Inspiration 

( Stanford Achievement Test , 

Advanced, Form W, Paragraph 
Meaning 29) 



Note: The correct answer is underlined. 
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4 ■ Semantic paraphrase 

Choosing the right answer requires recognizing a semantical variant 
(synonym, translation, paraphrase, etc.) of a word or phrase that 
i appears in the selection in a different linguistic context. 

1 ex: If you look at your hands closely, you will see that 

the skin has little ridges. The pattern of the 

ridges on the tip of one of your fingers never 
changes while you live, and this design is different 
from that on any other finger in the world. This is 
I why the police can use as a means of identi- 

fication. 

1. photographs 3. handshakes 

2. handwriting 4. fingerprints 

* ( Stanford Achievement Test , 

Intermediate 2, Form W, 

| Paragraph Meaning 4) 

5 " Definite concepts 

| Choosing the right answer requires identifying a "common" concept 

, 1 that 

a. is not stated in the selection 

s' b. definitely applies to the instances or attributes 

I mentioned in the selection 

c. and is the only choice that meets the above 
v conditions. 

j ex: The third-grade class went on a trip. They saw the 

\ fenced fields, the tall silo, and the powerful tractor. 

r They watched the horses and cows and fed the chicks. 

| They were even allowed to hold the baby rabbits. 

They saw many 

engines pigs trees animals 

( Gates-MacGinitie Reading Test , 

I Primary C, Form 2, Compre- 

hension IB) 




6 ■ Probable concepts 

Choosing the right answer requires identifying a "common" concept 

a. is not stated in the selection 

b. applies with a certain degree of appropriateness to the 
set of attributes or instances mentioned in the selection 

c. and is the choice that best meets the above conditions. 

ex: (read the selection under 5 above) 

The children went to a 

farm zoo park circus 

(Gates-MacGinitie Reading Test . 
Primary C, Form 2, Comprehen- 
sion 1A) 
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7 " Language concepts 

Choosing the right answer depends upon semantic and/or syntactic 
constructions such as: cliches, collo ulalisms , antonyms, 

relatives, antecedents, etc. which are not stated in the selec- 
tion, but are suggested by the general theme and/or contextual 
implications of the selection. 

ex: When Jane went shopping for a dress, 6he bought the 

least expensive one hsr limited budget. 

1. in spite of 3. regardless of 

2. notwithstanding 4. on account of 

( Stanford Achievement Test , 
Intermediate II, Form W, 
Paragraph Meaning 16) 



8 ■ Previous knowledge 

Choosing the right answer requires previous knowledge, usually 
obtained in a formal setting, of specific facts such as dates, 
names, relationships, places, etc. 

ex: From 1850 to 1880, Virginia City held a prominent 

place in the history of silver and gold mining. Its 
fabulous production of silver and gold has left a 
tremendous impression on all who ever heard of it. 

This production played an important role in financing 
the Union during the . 

1. War between the States 

2. Revolutionary War 

3. War of 1812 

4. Mexican War 

(Stanford Achievement Test , 
Intermediate II, Form Y, 
Paragraph Meaning 2) 

9 ■ Word-picture matching 

Choosing the right answers requires matching a word to its 
corresponding picture. If the picture is not a clear 
and 6imple representation of the word, other of the above 
categories may be added. For example, if the picture 
represents a probable concept, it would be rated "96." 
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RATING SCALE FOR CHOICES 



, Definitions 

I 

Textual - the distractor is stated in the selection (possibly in 

a different number, tense, etc. If there are a number 
! of words, the distractor is rated textual when: 

! 

a. some of the words are stated explicitly in the 
selection and some are paraphrased 

l b. most of the content words are stated explicitly 

: in the selection 



J 



Grammatical - the distractor fits the grammatical context of the 

question.' Lexical constraints on this category include: 

a verb that can only have an animate subject or object; 
an adjective that can only modify animate nouns, etc. 

Categorical - the distractor fits the same general category of descrip- 
tors, objects, events, etc. as the correct choice. This 
category is determined ly the word meaning as well as 
its context in the question, and selection. Where 
appropriate this refers to distractors that are coordi- 
nates, synonymous or antonyms of the correct choice. 



Associative - the distractor has "associative value" to either the 
general theme of the selection or the meaning of the 
correct choice. This category is not as close to the 
meaning of the right choice as "categorical" above, 
yet it is not irrelevant. Where appropriate this 
refers to distractors that are superordinate, subordi- 
nate, functions or features of the correct answer. 



Other - the distractor is irrelevant and thus unrelated to either 

the general theme of the selection or the meaning of the 
correct choice. It is not found in the reading selection, 
nor would it be a grammatical answer to the question. 



1 ■ Other 

2 ■ Grammatical 

3 ■ Associative 

4 ■ Associative-Grammatical 

5 ■ Categorical 

6 ■ Categorical-Grammatical 



Codes 3 

7 ■ Textual 

8 ■ Textual-Grammatical 

9 ■ Textual-Associative 

10 ■ Textual-Associative-Grammatical 

11 ■ Textual-Categorical 

12 * Textual-Categorical-Grammatical 



a When a category is not included in the code, it does not apply, e.g. 
code "2" indicates that the distractor is a grammatical answer to the ques- 
tion, but is not "associative," "categorical," or "textual." Code "7" indi- 
cates that the distractor is a word in the text but is not "grammatical," 
"categorical," or "associative." 
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Examples 

1 

First Mother measured the milk, baking powder, 
shortening, flour, and sugar. Then she mixed these 
together with two beaten eggs. Finally she poured 
the batter into a pan and put the pan into the oven. 

A. Mother was making 

a a 

a cake a dress (2) cookies (6) flour (10) 

B. She did not use any 

milk (8) salted (5) a pepper baking (9) a sugar (12) 

4 

Ruth was busily getting her costume ready for 
the party. She had already made a tall pointed hat 
out of black paper. She and her mother had just 
finished a long black cape. The broom that she would 
ride was standing in the corner. 

A. Ruth was going to the party as a 

witch ghost (6) cowgirl (4) costume (ll) a pumpkin (4) 

B. Ruth still needed a 

tall (7) a fun (3) a see (l) a mask 



( Gates-MacGinitie Reading Test . 
Primary C, Form 1 Comprehension) 



Note 1: Numbers in parentheses are example codes. 

Note 2: Underlined words are the correct answers. 

^hese distractors were not in the original items. They were in- 
cluded here for the purpose of demonstrating a particular scale category, 
e.g. in the first paragraph, question A, the choice "cookies" was added 
to demonstrate a "categorical-grammatical distractor. "Cookies" are not 
mentioned in the selection yet they are baked goods and essentially the 
same type of object as a cake. Also, "cookies" completes the question 
sentence in a grammatically acceptable manner. Another example is the 
next choice **- "flour." "Flour" is a textual-associative-grammatical 
distractor. The word "flour" is stated in the text. Flour is an ingredi- 
ent of a cake and is thus "associative." "Flour" also completes the ques- 
tion sentence in a grammatically acceptable manner. 

Some of the original distractors in these items were omitted because 
they duplicated ratings already demonstrated. 
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The raters, a Roxbury Latin School senior, a Radcliffe College senior 
and a Harvard doctoral student were trained as follows: 

1. The rater studied the rating scales which presented 
short definitions for each category, code numbers for 
each category, and usually example rating for each 
category. 

2. The rater and author discussed the rating scales until 
the rater stated that he understood them, e.g. the rater 
asked questions about the scale and checked word 
definitions. 

3. The rater applied the scales to a few random items taken 
from three test levels but from different forms of the 
test batteries used in this study. 

4. The rater was asked to justify each of his ratings.^ 

5. The rater then applied the scales to the SAT Intermediate I. 

He again had to justify each rating. 

The reading comprehension sub tests analyzed in this study were 
presented to each rater in a different random order. Different random 
orders were used to avoid biases in ratings that may have resulted from 
a standard sequence, i.e. coding all the lower level tests in a series, 

^The rater was asked why he chose a code. He generally replied by 
referring to part of the definition. Occasionally during the justifying 
procedure, when raters looked back at the scales and the test items, they 
spontaneously changed their rating. Where definitions on the scales were 
not sufficiently clear, they were revised at this point. 



102 



or all the tests in one battery in a series might have led raters to 
using codes that appeared frequently in a mechanical way. By 
randomly distributing the tests the possibility of raters using the 
same codes habitually was probably reduced. 

When a subtest had been coded by each rater, the results were 
compared. When differences occurred each rater gave the justification 
for his code. The justifications were discussed.^ - Generally, a 
consensus was quickly reached among the raters. The code that all 
raters agreed on for a given selection, question or choice was the 
one noted. In the case of 2 questions and 6 choices no consensus was 



Sometimes a dissenting rater changed his rating spontaneously after 
rechecking the category definitions. Other times a dictionary was used 
to justify ratings, e.g. in choices, to establish whether a given dis- 
tractor was a synonym, antonym, or feature of the correct answer. Such 
information determined whether the distractor was coded "associative" or 
"categorical." Another means of reconciling differences was for each 
rater to present his reasoning, and also to evaluate the reasoning of the 
others, e.g. again in choices, the correct answer choice was "flowers," 
and one of the distractors was "things." One rater contended that 
"things" was too general and was not really "associative." Another rater 
reasoned that "things" could be used as a substitute for "flowers" with- 
in the context of the reading selection without significantly changing 
the meaning (see cloze-like item, SAT , Primary 1, 28). the third rater 
stated that "things" was general but was still relevant and thus should 
not be coded "other." All raters agreed to rate "things" as "associative 
One other approach to reconciling differences was compromise, e.g. one 
rater coded a distractor as "other," another rater coded the same dis- 
tractor as "categorical." After trying most of the above approaches, if 
the raters still could not unanimously agree on one of the given ratings, 
they compromised at "associative." 
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I 

reached. One rater contended that his code was as justified as the 
| other. 1 In these 8 cases the code agreed on by two of the raters was 

I used for the analysis. 

1 

Treatment of Data 

I 

j The frequency with which each selection, question and choice scale 

category appeared on each of the .10 reading comprehension subtests 
studied was tabulated. The frequency with which the scale categories 
appeared at each of the 3 test levels and in each of the 3 test 
batteries was also tabulated. 

In addition a comparative study wa^ made of the similarities of 
items in the reading comprehension subtest of one test battery, and 
items in other subf.ests, e.g, word knowledge, science, social studies, 
in the same battery. 



"Tlatings of choices presented the most problems. The greatest 
disagreement among raters was in the "associative" and "categorical" 
codes. On the lower level tests the definition criteria of subordinate, 
super ordinate, coordinate, etc. were applied and fewer disagreements 
existed. However on the higher level tests when word meanings became 
more abstract and unfamiliar the judgments became more subjective and 
the differences among raters more numerous. 

Ratings of questions became difficult when two catetories over- 
lapped, i.e. a given question seemed equally appropriate for two cate- 
gories. For example, a question sentence very closely approximated the 
selection sentence in which the information was originally given. 

However, the question sentence was not really identical to the selec- 
tion sentence in that a modifier was added or omitted or the selection 
sentence was active while the question sentence was passive. These 
differences had to be subjectively evaluated and thus one rater coded 
the question "recognition" while the other coded it "contextual paraphrase.'" 

Ratings of selections presented the fewest problems since they were 
generally self-explanatory and mutually exclusive. 
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Results and Discussion 

The task data are presented and discussed in the form of 
general conclusions about the four objectives of this study. Selected 
data tables are included in the text to follow; however, for the 
reader who is interested in more specific results, the percentage of 
each rating in tests and levels is presented in Appendix A, Tables 
32-38. 

The first objective of this study was to characterize the nature 
of reading comprehension as tested at three grade levels . 

In order to determine the tasks common to the CAT , GMRT and SAT , 
the data of these test batteries were combined for the lowest level 
tests, for the intermediate level tests, and for the advanced level 
tests. 

Figure 3 presents the composition of typical items in tests 
Intended for grades 1-2, in tests intended for grades 4-6 and in tests 
intended for grades 9-14. 1 Tables 33 to 35 in Appendix A present the 
percentage of reading selections, questions and choices in each scale 
category at each of the three test levels. Seven generalizations 
were made on the basis of the task data. 

1, Typical reading selections were different on the lowest, 

intermediate and advanced test levels. On lowest level tests most of 
the reading selections (71%) were stories. Stories were generally 

^Typical as used here refers to the most frequently occuring 
category. 
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Figure 3: Typical reading comprehension items at three test levels 



about common objects, experiences or people. At the intermediate 
level reading selections about science were most prevalent (49%) , 
Science selections included general science, biology, physics, etc. 
Although science selections were also quite frequent in the advanced 
level tests (31%), selections about humanities (23%) such as art 
and theology, and about social science (20%), such as psychology, 
were also numerous. This suggests the second generalization, 

2. The range of selection topics became broader at higher test 
levels. In tests intended for grades 1-2, nearly all the selections 
(71%) were stories, the next highest category was riddles (24%). 

There were also 4% science selections at the lowest test level. 

Although science selections were most prevalent (49%) in the 
intermediate test level, 19% of the reading selections were about 
social studies and 16% of the selections were stories. 

As can be seen in Table 33, Appendix A, f *he reading selections 
at the advanced level were distributed among even more categories. 

3. Reading selections were not only about more topics at each 
higher test level; they were about more academic topics. Stories about 
common people, experiences, and events consistently decreased at each 
higher test level, i.e. 71% of the reading selections at the lowest 
test level were stories; 16% of the reading selections at the inter- 
mediate test level were stories; and, only 5% at the advanced test level 
were stories. Reading selections about more basic school subjects such 
as science and social studies, hardly appeared at the lowest test level, 
were most prevalent at the intermediate level. and became fewer at the 
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advanced level. Reading selections about the more academic subjects, 
such as social science and humanities, appeared more often in the 
advanced level tests. 

All "school subjects" were not equally represented however. Few 
reading selections about math, literature and music appeared in com- 
parison to many reading selections about science and social studies. 

Consequently, reading comprehension was tested on selected and 
more "advanced school subjects" at each higher test level. Reading 
selections resembling reading matter from "life outside of school" 
were extremely infrequent, especially at the higher test levels. Yet, 
it would seem that for the greater population, especially those not 
pursuing academic careers, evaluation on' more "everyday" reading 
matter would be considerably more important than evaluation on academic 
reading matter. "Everyday" reading matter includes the things a person 
should be able to read in order to function effectively in today's 
world, e.g. newspaper articles, advertisements, guarantees, warranties, 
proposed legislation, trade manuals, job applications, tax forms, 
instructions for using appliances or tools, directions for cooking or 
baking, food ingredients, and so on. 

The reading of selections represented only one part of the task 
required by reading comprehension subtests. Another part of the task 
was using presented information to answer questions correctly. 

4. Typical questions were different for the lowest, intermediate 

and advanced test levels. In the tests intended for grades 1-2, pupils 
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were asked to identify the words for common objects or generally 
familiar concepts suggested in the reading selection. Such questions 
were called "probable concept" (for definitions and examples of ques- 
tion categories see the Rating Scale for Questions , p. 86 ) ^ made 
up 37% of all the questions in tests intended for grades 1-2, Many 
(26%) of the other questions at the lowest test level asked pupils to 
match words with corresponding pictures. 

Typical test questions at the intermediate level (35%) asked 
pupils to identify one of the words used in the reading selection as 
the correct answer to the question. However, the context of the word 
in the selection was different from the context of the word in the 
question. Such questions were called "contextual paraphrase."^ 

The typical questions in the advanced level tests were of four 
types. Twenty-four percent of the questions were "semantic paraphrase," 
Twenty-one percent of the question at the advanced level were "contex- 
tual paraphrase." Eighteen percent of the questions were "probable con- 
cepts" and another 18% were "previous knowledge." 

The progression of questions from one test level to the next higher 
test level analyzed seemed to be of two sorts. First, lower level test 
I 

The differences between a word's context in the selection and the 
same word's context in the question varied. Sometimes a logical rela- 
tionship was established for the two contexts, by the test— author, in 
the reading selection; at other times it was not. In cloze-like items 
the word sometimes appeared in the reading selection before the blank 
(which represented the question) , and sometimes after. The effects of 
Such differences were not taken into account in this study, but may be 
useful to investigate in future research since such differences may 
influence test performance. 
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questions were generally about common or general knowledge. Higher 
level tests contained progressively more questions requiring previous 
knowledge of a more "academic" nature. 

Second, lowest level tests represented a more limited use of words 
and concepts than higher level tests. For example, in "matching," a 
word and picture usually represented identical things. Also, in 
"contextual paraphrase" -the same word was used both in the reading 
selection and in the answer choice. On the other hand, in higher 
level tests, "semantic paraphrase" used different words to say the 
same or similar things. And "previous knowledge" required the use of 
numerous words and concepts neither presented nor necessarily implied 
in the reading selection. 

5, The range of question tasks became broader at higher test 

levels. The highest concentrations of question tasks were in "probable 
concepts" (372) and matching (26%) at the lowest test level. Although 
many questions were concentrated in "contextual paraphrase" (35%) at the 
intermediate test level, there were also many "probable concept" (15%), 
and "previous knowledge" (15%) questions. At the most advanced test 
level, there was an even broader distribution, i.e. 24% "semantic 
paraphrase," 21% "contextual paraphrase," 18% "probable concept," and 
18% "previous knowledge" questions. 

Consequently, at the lower grade levels., pupils could achieve 
adequately on subtests of reading comprehension if they could match 
pictures to words and could identify simple words and concepts; At the 
intermediate test level » pupils were being tested more on the flexibility 
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of their vocabulary, e.g. using the same words in different contexts. 
At the most advanced test level pupils were tested mere on the 
breadth of vocabulary, e.g. saying the same thing in different ways. 
Generally, however, it was not a question of the student having to 
supply the correct answer to a question. Rather, the student had to 
choose the right answer to the question from a number of choices which 
related to the question in different ways. 

6. Typical distractors were similar in the lowest, intermediate 
and advanced test levels. At each test level the most frequent 
distractors were words that were grammatical answers to the question 

s" 

as well as somewhat related to the meaning of the correct answer, 
i.e. words that described a function, attribute, etc. of the correct 
answer. These distractors were called ’’associative-grammatical" and 
were 33% of the distractors in the lowest level tests, 30% of the dis- 
tractors in the intermediate level tests and 33% of the distractors 
in the advanced level tests. The second most frequent type of 
distractors were those that fit the grammatical context of the question 
but were otherwise unrelated to the correct answer. Such choices were 
called "grammatical" and made up 20% of the distractors in tests in- 
tended for grades 4-6 and 25% of the distractors in tests intended for 
grades 9-14. 

7. Despite the similarities among the typical distractors at the 

i 

three test levels some differences did appear in the overall distribu- 
tion of distractors. The percentage of "grammatical" distractors 
consistently increased from level to level, and so did the percentage 
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of "associative" distractors, e.g. "grammatical" distractors were 16% 
at the lowest test level, 20% at the intermediate test level and 
25% at the advanced test level. 

The other difference appeared when all those distractors that were 
words used in the reading selection were combined, no matter what other 
type of relationship they had with the question or correct choice, i.e. 
adding the number of "textual," "textual-grammatical," "textual- 
associative," etc. distractors for each level. Lower level tests had 
more distractors that were words used in the reading selection than 
higher level tests. The percentages were 35%, 27% and 23% in lowest, 
intermediate and advanced level tests respectively. 

The second objective of this study was to characterize the nature 
of reading comprehension as tested by different test batteries . 

In order to determine the tasks characteristic of each test battery, 
the lowest, intermediate and advanced test levels within each battery 
were occasionally combined. Figure 4 presents typical items for the 
CAT , GMRT and SAT . Tables 36-38 in Appendix A, present the percent 
of reading selections , questions and choices in each scale category 
for the CAT , GMRT and SAT . The following 6 generalizations were made 
on the basis of the task analysis. 

1* Findings about test levels in the CAT GMRT and SAT were 

similar to the findings about test levels when batteries were combined: 

a. typical reading selections were different in the lowest, 
intermediate and advanced level tests. 

b. the range of selection topics became broader at higher test 
levels. 
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Figure h: Typical reading comprehension items in three test batteries 
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c. reading selections at each higher test level included more 
"academic” topics. 

d. typical questions were usually different in the lowest, 
intermediate and advanced level tests. 

e. the range of question tasks became broader at higher test 
levels. 

f. differences among distractors at the 3 test levels became 
clearer when test batteries were analyzed separately. Only 
the SAT had consistently similar distractors at the three 
grade levels. The GMRT and the CAT had different combina- 
tions of distractors in the 3 test levels analyzed. 

For example, typical reading selections were different for the 
3 test levels of the CAT , GMRT and SAT , e.g. in the lowest level CAT , 
"story" was the category of all the reading selections; but "story" 
never appeared in the intermediate and advanced CAT , In the inter- 
mediate level CAT, 67 % of the selections were about science and the 
other 33% were about social studies. In the advanced level CAT , 40% of 
the reading selections were about social studies and 20% each were 
about social science, science and humanities. In the GMRT lowest level, 
88% of the reading selections were stories and the other 12% were about 
science. The intermediate level GMRT had 43% "science" selections, 

28% "social studies" selections and 19% "stories." At the highest 
level the GMRT had 33% "science" selections, 33% "humanities" selec- 
tions and a few selections in a number of other subject areas. 

Consequently, the CAT , GMRT and SAT bad a different combination 
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of reading selections at each level. The CAT , GMRT and SAT also 
differed in their combinations of reading selections, especially 
at the intermediate and advanced test levels. 

2. Typical reading selections were somewhat different for the 

CAT , GMRT and SAT . For example, combining the test levels, the CAT 
had three frequent kinds of reading selections: 33% "story," 25% 

"social studies" and 25%. "science. Both GMRT and SAT typically had 
either "story" or "science" selections. The GMRT had 35% "story" 
selections and 31% "science" selections, while the SAT had 30% "story" 
and 30% "science" reading selections. 

3. The CAT , GMRT and SAT differed in the number of selection cate- 
gories they included. The CAT had the fewest categories, i.e. 

"story," "social studies," "social science," "science" and "humanities." 
The GMRT had six categories, i.e. "story," "language," "social studies," 
"social science," "science" and "humanities." The SAT had the most 
reading selection categories, i.e. "riddle," "story," "language," 

"math," "social studies," "social science," "science" and "humanities." 

Despite the differences among test batteries in the topics of read- 
ing selections at the inter-mediate and advanced test levels, the 
reading selections all tended to be about school subjects. As noted 
earlier, there were essentially no selections that resembled other than 

^The breadth of reading selections in the CAT may be deceiving. 

The CAT had the fewest selections of any battery, e.g. the CAT had 12 
selections in the entire battery compared to 58 selections for the GMRT 
and 95 for the SAT . Hence, even a few selections in one topic became a 
rather high percentage. 
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"school type" reading material, e.g. newspaper articles, advertise- 
ments, recipes, job applications, etc. Furthermore, the selection 
topics resembled only a narrow range of school subjects, i.e. 
science and social studies, Reading selections in literature, 
math, art , etc. were few in number. 

4. Typical questions were also somewhat different for each test 
battery. Characteristically, the CAT asked either "contextual para- 
phrase" (28%) or "semantic paraphrase' 1 (21%) questions (see Rating 
Scale for Questions , p. 86, for a definition and example of question 
categories). The GMRT characteristically asked "previous knowledge" 

s" 

(22%) , and "probable concept" (21%) questions. On the SAT . 34% of 
the questions were "contextual paraphrase" and 22% of the questions 
were "probable concept." 

Possibly the question structure, e.g. cloze-like blanks in the 
reading selection, separate questions follovring the reading selection, 
influenced the question task. The CAT which always had separate 
questions primarily had "contextual or semantic paraphrase" tasks. 
However, the CAT he'd other tasks as well, i.e. 17% "probable concept", 
12% "recognition," 7% each of "grammatical paraphrase," "definite 
concept," and "previous knowledge." 

The GMRT , which on the intermediate and advanced test levels always 
had cloze-like blanks primarily had "previous knowledge" or "probable 
concept" t£sks. However, the GMRT also required other tasks, i.e. 

17% "matching," 13% "language concept," 10% contextual paraphrase," 

9% "semantic paraphrase" and 7% "grammatical paraphrase."^* 
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Finally, the SAT , which had both cloze-like and separate ques- 
tions seemed to require tasks most characteristic of both the CAT , 
e.g. "contextual paraphrase" and the GMRT , e.g. "probable concept." 

The SAT also had 14% "previous knowledge," 14Z "semantic paraphrase," 

9% grammatical paraphrase," 6% "language concept," 1% "definite 
concept" and less than 1% "recognition" tasks. 

Thus, it appeared that the different types of questions, e.g. 
cloze-like blanks, separate questions, were used to create almost 
all of the defined tasks, e.g. "contextual paraphrase," "previous 
knowledge." However, certain types of tasks seemed most characteristic 
of certain types of questions, e.g. cloze-like blanks were charac- 
terized by requiring the use of general or academic knowledge not 
presented in the reading selection. Separate questions were charac- 
terized by tasks requiring the use of words stated in the reading 
selection in a different context, or the use of different words to 
restate ideas presented in the reading selection. 

5. Choices were also somewhat different in the CAT , GMRT and SAT . 

CAT distractors were most broadly distributed, e.g. 27% were 
"associative-grammatical," 24% were "grammatical" and 24% were 
"categorical-grammatical" (see Rating Scale for Choices, p. 89, 
for a definition and example of choice categories.) 

GMRT di6 tractors were generally either "associative-grammatical 
(30%) or "grammatical" (27%). SAT distractors were generally 
"associative-grammatical (34%) . 
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Most distractors in the CAT , GMRT and SAT were grammatical 
answers to the questions posed. All CAT distractors were grammatical 
answers to the question. However 78 GMRT distractors and 28 SAT 
distractors were not grammatical answers to the question. When un- 
grammatical distractors were used to answer questions they formed 
odd-sounding sentences (see Appendix B). Inappropriate distractors fell 
into 4 categories: 

a. simple grammatical error - the distractor did not agree 
with the number or tense of words in the question. For 
example: "The values of such reinforcement induces the 
student... (SAT , High School, Q. 19)." 

b. category error - the distractor represented the wrong part of 

speech, e.g. the question called for a noun, but the distractor 
was an adjective. For example: "To receive the money, he must 

show proper own (GMRT , Survey D, Q. 31)." 

c. feature error - the distractor represented a semantic anomaly, 

e.g. the question called for an animate subject, but the 
distractor was inanimate. For example: "Pete is a house ( SAT , 

Primary 1, Q. 35)." 

d» reality error - Awareness of "reality" made the distractor 
seem inappropriate. For example: "The children were very 

empty (GMRT , Survey D, Q. 1)." 

Many of the grammatical distractors also had "association value" 
to the correct answer. Miller (1963) described the word-association 

Note: Distractors are underlined 
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studies which demonstrated that consistencies existed in the types 
of associations different people have to given words. Studies like 
W-* drow and Lowell's (1916) tabulation of the relative frequencies 
as well as categories of word associations for children and adults 
suggests a possible means of investigating relative difficulty of 
a set of dis tractors. For example , distractor sets may be compared 
by the sum of relative frequencies of associations , or by the fre- 
quency of categories of associations, e.g. if the correct answer 
were "table" and the distractors were "furniture (superdrdinate) " 

"eat (verb)" and "able (assonance)" a relative difficulty score might 
be obtained by adding the relative frequencies from the Woodrow- 
Lowell list: 3.7 (table-furniture), 10.2’ (table-eat) and 0.43 (table-able) 
■ 14.33. In this manner it might be possible to systematize the combina- 
tion of distractors rather than continuing the present rather random 
and intuitive procedure. Furthermore, if identification of differ- 
ences and sources of difficulty of distractor sets becomes possible, 
diagnosis of pupil errors that result from particular combinations 
of distractors may also become possible. Such diagnosis may help 
teachers provide pupils with more direct instruction as well as more 
specific exercises. 

Other distractors represented the same kind of objects, 
events, etc. as the correct answer. What relationship a particular 
type of distractor had to test levels cr item difficulties was not: 
clear from the results' of analyses conducted here. 



6. The choices were different in the lowest, intermediate and 
advanced test levels of the CAT , GMRT , and SAT . Both CAT and GMRT 
distractors seemed more related to the selection, question and correct 
choice at the lowest level than at the higher teat levels. For 
example, many of the lowest level CAT distractors were grammatical 
answers to the question as well as "associative" to the correct 
choice (35%), or were a combination of grammatical answers to the ques- 
tion, the same kind of "object" as the right choice and also in the 
reading selection (30%). Many of the intermediate level distractors 
were "categorical-grammatical" (40%), and many of the advanced level 
distractors were simply "grammatical" (30%). 

SAT distractors showed an opposite trend. Distractors in the 
lowest and intermediate level tests were usually "associative-grammatical." 
Distractors on the highest level test were either textual-categorical-, 
grammatical" or "associative-grammatical." 

Thus it appeared that while the CAT and GMRT shared a similar 
pattern of distractors, i.e, using more words from the reading selec- 
tion at the lowest test level than at either of the higher levels, 
the SAT had an opposite trend, i.e. using more words from the reading 
selection at the highest level than either of the lower test levels. 

The third objective of this study was to identify the factors that may 
contribute to difficulty of tested comprehension . 

Correlations among empirical difficulty scores — the criterion of 
difficulty in the present study — and task ratings were not possible 

since the task ratings were descriptive and not quantitative. However, 

* 
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two major observations were made about sources of item difficulty 
during the rating of test items, 

1, Generally it appeared from the task ratings that either 

selections, questions or choices may be the sources of item difficulty. 
Items that were passed by only a small percentage of the pupils in the 
try-out population contained one of the following: 

a. selections that had unclear or uncommon information 

b. questions that required knowledge of specific facts 
or ideas 

c. distractors that seemed to be appropriate answers to 
the question. 

For example, in the GMRT , Survey F, the meaning of the selection 
empirically found most difficult, i.e, the questions with the selection 
were passed by an average of about 20% of the try-out population, was 
unclear. The selection was rated as "humanities" by the raters more by 
process of elimination than by a conviction that it represented 
philosophy. 

The objects of science, like the direct objects of 
the arts, are an order of relations which serve as 

tools to 5 0 immediate havings and beings. 

Goods, objects with 51 of fulfillment are 

the natural fruition of the discovery and employment 
of means when the connection of ends with a sequen- 



tial 


order is 


52 


• 




50. 


effect 


prevent, reduce 


export replace 


51. 


enj oyment 


thoughts 


uses 


ends qualities 


52. 


weakened 


required 


judged 


determined lost 



Note: The correct answer is underlined. 
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Furthermore, what the questions were testing was also difficult 
to evaluate. ^ According to the Rating Scale for Questions , questions 
50-52 were rated as "probable concepts."' Again raters picked this 
question category more by a process of elimination than by a clear 
understanding of what was being asked. The distractors were 
generally "grammatical," "associative" or "associative-grammatical" 
except for the distractor "ends" which was also used in the rending 
selection. 

The SAT , Intermediate I, Question 50 demonstrates a difficulty 
that seemed to be related more to the choices than to the selection or 
question. The raters judged the selection as "science" with no 
difficulty. 



Cattle, sheep, goats, antelope, and deer are similar 
in many ways. They all have hooves and may have horns. 
Also, they all have a fourfold stomach. Their food is 
swallowed in haste and is then returned to the mouth a 
little at a time to be chewed methodically before it 
is transferred to the other sections of the stomach for 
gradual digestion. In this respect these ruminants, or 
cudchewers, are alike. One major difference is in the 
horns. Cattle have horns with cores composed of honey- 
combed bone. The horns of antelope are practically 
solid bone, whereas the antlers of deer are true bone. 
The deer shed their antlers every year in the way a 
deciduous tree sheds its leaves, a detail in which 
they are unique. 

50. The best title for this paragraph would be 



a. The Ruminants c. Horn Structure in Animals 

b. How Many Stomachs? d. Deer, Sheep, and Cattle 

Note: The correct answer is underlined. 

The average difficulty score for these questions was 19.8%. When 
a question has 5 answer choices, each choice has a 20% probability of 
being picked by chance. Thus, it would appear that in an item with a 
selection which was meaningless, a question which was totally ambiguous, 
and distractors which were neutral, each choice would be picked by 20% 
of the testees. 
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The question was rated as "contextual paraphrase" since the 
word "ruminants" was used In the reading selection In another 
context. In this test Item the source of difficulty seemed to be 
the choices especially "d", i.e. dis tractor "d" was rated as 
"textual-categorical-grammatical. In a sense, distractor "d" was 
almost a definition or an illustration of the correct answer and could 
easily have been substituted for the correct answer. Distractors "b" 
and "c" also related to the correct answer in that they included 
"attributes" of ruminants which were touched upon in the reading 
selection. This test item was answered correctly by only 11% of the 
pupils in the standardization population (Kelley, et al, 1966, p. 48). 

2. Generally raters seemed to have greater difficulty in identi- 
fying appropriate ratings for selections and questions of ambiguous 
items which were passed by a smaller percentage of pupils. For 
example, as noted above, in such items ratings usually were made by 
the process of elimination. 

As illustrated in the comparison of test levels, items in 
higher level tests seemed to become more difficult because they were 
based on reading selections about more academic or obscure subjects 
and required previous knowledge of specialized subject matter as well 
as broader vocabularies. 

Possibly the aspects of reading selections that bring about item 
difficulty, e.g. clarity, generality, abstractness, could be quantified 

^Although the words in the choice were not exactly in the same order 
as in the reading selection they still all appeared close together and 
were thus also rated "textual." 
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and subsequently correlated with difficulty scores. For example, a 
number of raters might be asked to rate reading selections by a 
"semantic differential". Semantic differentials cculd measure 
ideational, language and affective characteristics. A sample of 
three semantic differentials is presented in Appendix C. 

The fourth objective of this thesis was to characterize the 
nature of tested reading comprehension . 

1, Three major conclusions have already been presented about 

the nature of tested reading comprehension: 

a. Tests of comprehension intended for grades 1-2, 4-6 and 9-14 
characteristically had different, reading selections and 
questions. Selections at the lowest test level were 
usually stories about common experiences, people or events, 
while selections at higher test levels were usually about 
science, history or humanities . Questions on the lowest 
level tests asked general information or required the match- 
ing of words to corresponding pictures. Intermediate level 
questions required the use of a limited number of words 

in different contexts. The advanced level tests required — 
restating ideas, using words in different contexts as well 
as knowing "concepts" especially in science, social studies 
and the humanities. 

b. The CAT , GHRT and SAT included most types of selections, 
questions and choices identified by the rating scales, but 
they differed characteristically in the distributions of 
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selection, question and choice ratings. The CAT, GMRT 
and SAT generally had "story" selections at the lowest 
test level and science selections at the higher levels. 
However, the SAT had more selection types than the GMRT 
and CAT . The CAT and GMRT had a large percentage of 
selections about "humanities" at the highest level while 
the SAT did not. 

CAT questions were more of a "paraphrase" type, i.e. 
using words presented in the selection in different 
contexts, or restating ideas presented in the reading 
selection. GMRT questions were more of a "concep- 
tual" type, i.e. using either general, or specific 
information not stated in the reading selection. 

While words from the selections of the lowest level CAT 
and GMRT were frequently dis tractors, words from the 
selections of the intermediate and advanced CAT and GMRT 
were seldom dis tractors. On the SAT, words from the selec- 
tion were more often dis tractors at the higher than at the 
lower test levels. 

c. Item difficulty seemed to be related to the lack of clarity 
in the reading selections , the amount of uncommon or 
academic information required by the questions, and the 
similarity of meaning between the correct choice and the 
distractors. A rough indication of item difficulty seemed 
. to be the difficulty raters had in categorizing test items. 

These conclusions suggest that reading comprehension test items 
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especially at higher test levels could be testing "information" and 
"skills" that related to other school subjects as well. In charac- 
terizing reading comprehension, it seemed appropriate to establish 
the unique qualities of reading comprehension test items. Toward 
this end reading comprehension test items and items of other dis- 
ciplines, e.g. science, social studies, were compared. On the. basis 
of this comparison another conclusion was reached. 

2. Reading comprehension test items closely resembled test items 

for other school subjects such as science and social studies. 

To illustrate the similarity between comprehension test items 
and test items from other school subjects a total of 8 test items were 
selected from the social studies, science, word meaning, paragraph 
meaning, i.e. comprehension, and mathematics subtests of S tanf ord 
Achievement Tests . ^ 

The reader is requested to read each of the following test items 
carefully, to establish the kind of "information" or "skill" needed 
to answer the questions ?nd, consequently, to determine which school 
subject, i.e. social studies, science, word meaning, paragraph meaning, 
or mathematics the following items test: 

^he paragraph meaning sub test of the Stanford Achievement 
Test was the reading comprehension test chosen for this exercise for 
the following reasons: 

a. the paragraph meaning subtest (SAT ) tended to contain qualities 
of both the CAT and GMRT (see preceding readability and task 
analyses of comprehension subtests). 

b. publishers of the Stanford Achievement Tests generously provided 
the subtests for science, social studies, word knowledge, etc. 

c. intercorrelations of subtest scores were readily available in 
test manuals. 
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1. "0 beautiful for heroes proved. In liberating strife. .. ,v 

These heroes were probably the heroes of . 

a. 1914 

b. 1861 

c. 1776 

d. 1898 

2. From 1850 to 1880, Virginia City held a prominent place 

in the history of silver and gold mining. Its fabulous 
production of silver and gold has left a tremendous 
impression on all who ever heard of it. This production 
played an important role in financing the Union during 
the _. 

a. War between the States 

b . Revolutionary War 

c. War of 1812 

d. Mexican War 

3. Costa Rica is south of the United States. Since Costa 
Rica is in Central America, the United States is ______ 

of Central America. 

a. north 

b. south 

c . part 

d. in the middle 

4. A boy has to walk directly west in going from his home 

to school. To come home quickly, he should walk . 

a. north 

b. west 

c. south 

d . eas t 

5. Ruth wasn't upset by the little old man» Although he was 

strange, she was rather pleased by him. She thought he 
was . 

a. wicked 

b. fearful 

c. quaint 

d. dirty 
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6. A person who attempts to change or improve conditions 
is called a . 



a. coward 

b. conservationist 

c. reformer 

d. conservative 

7. A country is measured and mapped by means of trigonometry- 
the branch of mathematics dealing with the measurement 
of triangles. When we know the length of one side of a 
triangle and the size of the two angles at its ends, we 
have the information that will give us the length of 

the other I of the triangle and the size of 

the third II of the triangle. 





I a. 


side 


II a. 


arc 


5 


b. 


three sides 


b. 


altitude 




c • 


four sides 


' c. 


base 




d. 


two sides 


d. 


angle 



8. Suppose that we knew the formula for the area of a 
triangle. We could use it to find formulas for the 
area of • 
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a. rectangles, squares, and paralellograms , but not 
trapezoids 

b. rectangles, squares, parallelograms, and trapezoids 

c. rectangles and squares, but not parallelograms or 
trapezoids 

d. none of the above 



Answers : 



1. Social Studies subtest, item 29 

Stanford Achievement Test, Intermediate 2, Form X 

2. Paragraph Meaning subtest, item 2 

Stanford Achievement Test , Intermediate 2, Form T 

3. Paragraph Meaning subtest, item 12 

Stanford Achievement Test, Intermediate 1, Form X 



4 . 



Science sub test, item 20 

Stanford Achievement Test, Intermediate 1, Form X 



Paragraph Meaning subtest, item 17 
Stanford Achievement Test, Intermediate 2, 
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Form X 



6. Word Meaning subtest, item 41 

Stanford Achievement Test , Intermediate 2, Form X 

7. Paragraph Meaning subtest, items 47 and 48 

Stanford Achievement Test , Intermediate 1, Form X 

8. Mathematics subtest, item 49 

Stanford Achievement Test , High School, Form X 

The selections, questions and choices of "paragraph meaning" — 
reading comprehension — ; test items were very similar to the selections, 
questions and choices of test items from other school subjects such as 
social studies, science, word meaning and mathematics. 

The investigation of similarity among subtests was pursued 

\ 

by a study of the relationship between reading comprehension test 
scores and the scores of tests in the other disciplines. 

3. Comprehension tests seemed to be measuring the same kind of 

"abilities" as tests of other school subjects especially word meaning, 
science and social studies. 

Table 9 presents correlation coefficients of Stanford Achievement 
Test paragraph meaning scores and scores of word knowledge, spelling, 
arithmetic, social studies, and science subtests. Correlation 
coefficients of paragraph meaning test scores and Otis I.Q. scores are 
also presented in Table 9. 

The paragraph meaning scores correlated very highly with word 
knowledge (.72 to ,83) , scienceC .72 to. 82) and social studies (.75 to .81) 
According to Commins and Fagin (1954, p. 327-328) "When a number of 
tests have high intercorrelations, we may assume that they are measuring 
to a large extent the same kinds of abilities..." 
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The consistently high correlations of comprehension and word 
knowledge test scores seemed to correspond with the earlier finding 
that many comprehension questions required breadth and depth 
of vocabulary, e.g. "matching," "contextual paraphrase," "semantic 
paraphrase."^ 

P. F.. Vernon's (1962, p. 269) observation that all subject matter 
tended "to take Che form of complex reading comprehension tests" seemed 
to apply in the reverse as well. The considerable percentage of 
"previous knowledge" questions on tests of comprehension suggested that 
pupil performance on tests of comprehension depended, in part, on the 
pupil's knowledge of information not stated in the reading selection. 

The numerous reading selections about science and social studies in 
tests of comprehension suggested that knowledge of science and social 
studies was required. The generally higher correlations of reading 
comprehension with social studies and science than with spelling and 
arithmetic seemed to corroborate this conclusion. 

Although the correlations for paragraph meaning test scores with 
scores of spelling and arithmetic tests were somewhat lower (.60 to .74), 
they still showed a considerably close relationship between the tests. 
I.Q. scores had a relatively low correlation ( .39) with paragraph mean- 
ing scores at the lowest test level. However, the correlation of I.Q. 
and paragraph meaning scores increased through the test levels and was 
.82 at the advanced test level. 

^"Breadth refers to knowing the meaning of many different words and 
depth refers to knowing the many meanings of a given word. 
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The Otis I.Q. test intended for lower elementary school grades 
consisted entirely of picture items and oral instructions (Otis, 1954, 
p. 1).^ The reading comprehension tests analyzed at corresponding 
levels, i.e. grades 1-2, required reading of words, sentences and 
paragraphs. Thus, the two types of tests did not appear to be testing 
similar "abilities" and the relatively low correlations were to be 
expected. However, at higher grade levels sections of many "paper-and- 
pencil" I.Q. tests were essentially identical to reading comprehension 
tests. Higher level I.Q. tests generally contained some reading 

selections, questions about the selections, and multiple-choice 

* 

answers. Thus, the two types of tests appeared to be testing some 
identical "abilities" and therefore, the higher correlations were to 
be expected. 

In addition, the high correlation at higher grade levels between 
scores on reading comprehension and I.Q. tests may also have resulted 
from the interdependent validity of these tests. For example, some 
reading test-authors assumed that "circumstances that contribute to 
high or low I.Q. scores in a school population are also the main factors 
contributing to high or low reading scores ( Gates and MacGinitie , 1970, 
p. 1). Thus these test authors used I.Q. scores as an "external 
validity criterion." Conversely, "many intelligence tests are validated 
against measures of academic achievement. .. (Anastasi , 1961, p. 190)," 
i.e. standardized achievement tests. The difference in correlations of 

^The California Test of Mental Maturity intended for lower elemen- 
tary grades also consisted entirely of picture items and oral instruc- 
tions (Sullivan, 1963, p. 6). 
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I.Q. and reading test scores at higher and lower levels may also be 
attributed to Chall's (1967, p. 138-139) suggestion that intelligence 
would be more of a factor in limiting performance on advanced "aspects 
of reading comprehension, such as 'reading to predict outcomes,' 

'making inferences, .and the like," than on less advanced aspects 
such as "reading for details and following directions." 

To view the relationship of reading comprehension test scores and 
scores of tests in other school subjects in proper perspective, a study 
of the relationship of scores from different reading comprehension 
tests was undertaken. 

4. Scores of different comprehension tests did not seem to be 

more highly related to each other than to scores of tests in other 
school subjects, 

Table 10 presents correlation coefficients for scores of the 
California Achievement Test comprehension subtest with scores of the 
a) California Achievement Test vocabulary subtest, spelling subtest, and 
and arithmetical reasoning subtest, b) California Test of Mental 
Maturity language and non- language I.Q.s , c) Metropolitan Achievement 
Test reading, i.e. comprehension, subtest, d) Iowa Tests of Basic Skills 
comprehension subtest and vocabulary subtest, and e) Stanford Achievement 
Test paragraph meaning subtest. 

Correlations of California Achievement Test comprehension scores 
with test scores of other school subjects generally corresponded to those 
on the Stanford Achievement Test presented in Table 9: 
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a. correlations were generally high 

b. the test most highly correlated with reading comprehension 
seemed to be word knowledge 

c. I. Q. scores had a low correlation with reading comprehension 
scores at the lowest test level, but a relatively high 
correlation at higher test levels. 

Table 10 indicates that the correlation at the lowest test level 

of scores on California Achievement Test comprehension and on Stanford 

Achievement Test paragraph meaning was .62. This correlation was lower 

than the correlation of the California Achievement Test comprehension 

\ 

subtest scores to both California Achievement Test vocabulary (.75) and 
spelling (.67) subtest scores at that level. 

At the intermediate test level the correlations among different 
reading comprehension subtests ranged from .78 to .83, while the 
correlations of reading comprehension subtests to subtests of other 
school subjects ranged from .50 to .79. 

The study of correlation coefficients did not indicate the 
existence of major differences among reading comprehension test scores 
and test scores of word knowledge, science, social studies, and intelligence. 
A comparative analysis of items from various tests clarified this 
phenomenon. All these tests appeared to require knowledge of word 
meanings and uses, knowledge of general information, and knowledge 
of information related to selected school subjects, e.g. 




social studies, science. Consequently, scores of reading comprehen- 
sion tests generally did not appear to tell the teacher more about 
pupils' "reading ability" than did scores of tests on intelligence, 
or other selected school subjects. 



There are numerous other influences on test performance which 
do not relate to item content and are therefore outside the topic of 
this thesis. For example, test characteristics such as test instruc- 
tions and the conditions under which the test is administered influence 
performance (Klein, 1971, p. 3-4). Many pupil characteristics also 
influence test performance such as motivation and test-taking skill 
(Anastasi, 1961, 61-66; Cronbach, 1954, 181-187). 



CHAPTER VI 



New Tests of Reading Comprehension 

Different tests of reading comprehension emphasize different 
stylistic elements as demonstrated by the readability analysis 
(Chapter IV), and different tasks, as demonstrated by the task anal- 
ysis (Chapter V) . Yet they all correlate highly with each other 

and with tests in other subject areas. Most of these tests appear to 
be measuring vocabulary, general intelligence, "reading" and previous 
knowledge of school subjects to a lesser or greater degree. 

Further study of the relationship between readability and tasks 
in reading comprehension tests would undoubtedly be enlighting.^ 

However, the information accumulated by the present analysis is 
sufficient to suggest some requirements of new tests of reading compre- 
hension. The new tests would not only establish che rank of a pupil in 
relation to pupils in the standardization or norming population of the 
tests, but would provide teachers with more specific diagnostic 
information. Such information could be used to establish a pupil's 
performance level in relation to the "criterion" of expected performance 
and consequently also point out specific weaknesses. The new tests would 
include 4 major features: 

iTo establish statistically whether differences exist in the 
empirical difficulty of the numerous combinations of selections, questions 
and choices an analysis of variance approach seems most appropriate. To 
establish the relationship among the numerous combinations of selections, 
questions and choices while controlling for the number or ratio of difficult 
words, a covariance approach seems appropriate. Both these approaches may 
be combined into one analysis of covariance using empirical item difficulty 
scores as data, and using the combined number or ratio of difficult words 
in the selection, question and choices as the covariate. 



1 . A defin i tion of minimum length, sentence length and hard word 



ratio for reading selections, questions and choices at the numerous 
grade or test levels . 

Reading comprehension relates to long and short reading matter 
as well as to reading matter with many or few hard words . ^ The pre- 
ceding analyses revealed that pupils at lower grade levels generally 
were tested by shorter reading selections with fewer hard words than 
pupils at higher grade levels. Yet, the most appropriate length or 
hard word ratio for reading matter at a given test level was not ap- 
parent, Establishing minimum "criteria 11 in this respect., for the grade 
or test levels would improve the understanding both of what reading 



iThe relationship of sentence length and "sentence complexity" 
has already been noted. In attempting to establish a minimum "criterion" 
for sentence, length or "complexity" analyses such as those by Carol 
Chomsky (1969) of the age level at which pupils acquire understanding 
of certain syntactic structures may prove most useful. 

Furthermore, lists of "easy" words which would be understood by 
selected age or grade groups are available. For example, Stone T s Re - 
vision of bale T s List of 769 Easy Words includes words which most 1st 
graders are expected to know. The bale List of 3000 Familiar Words 
includes words which most 4th graders are expected to know. Consolid- 
ation and expansion of these and similar lists could help establish a 
minimum "criteria" for a given grade or test level. However, in de- 
termining minimum "vocabulary" particular care should be taken not to 
discriminate against the segments of the population who may have a con- 
siderable "non-academic" vocabulary, but may have a limited "academic" 
vocabulary. 

Edgar Dale and Gerhard Eichholz have been working on comprehensive 
lists for selected grades. Their final results have not been published 
however an interim report, Children f s Knowledge of Words . Bureau of 
Educational Research and Service, The Ohio State University (1954 to 1960), 
was printed. 
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I comprehension at a given level entails and of what difficulties given 

pupils have in reading comprehension. ^ 

j 2. A definition of the subject of selections in reading comprehension 

tests . 

Reading comprehension is related to all school subjects and to 
I reading material not necessarily read in schools. But, the vocabulary 

‘ and language structures used in "school" and "non-school" reading mat- 

ter are not necessarily identical. Understanding reading selections 
about social studies for instance, does not necessarily indicate under- 
standing of trade manuals, or contemporary literature, 
j If the objective of the tests is to establish how well pupils 

i 

read "academic" subjects, then the tests selections about social 

i studies, science, and humanities for example, are most appropriate. 

1 

However, if the objective of the test is to establish how well pupils 
J understand "non-academic" reading, excerpts from newspapers , magazines, 

j etc. would seem more appropriate. And if the objective of the test is 

to establish how well pupils cope with vague or meaningless reading, 

I such reading selections would be appropriate. 



ldaser and Cox (1968, p. 545) in contrasting currently used 
achievement tests with "criterion-referenced" tests explained that the 
currently used tests "need provide little or no information about the 
degree of proficiency exhibited by the tested behaviors in terms of 
what the individual can do. They tell that one pupil is more or less 
proficient than another, but do not tell how proficient either of them 
is with respect to the subject-matter tasks involved." On the other 
hand, criterion-referenced tests assess "The degree to which an indi- 
vidual's achievement resembles desired performance at any specified 
level along the continuum of attainment...." 
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It would be useful to determine the grade level at which par- 
ticular topics could most appropriately be introduced or dropped 
in sequential testing.-*- For example, at the lowest grade level 
selections are mainly "stories." It is unclear whether other topics 
such as social studies or science might not also be introduced at the 
lowest grade level. ^ As revealed in the preceding analysis, the per- 
• centage of stories about common events, people or experiences at the 

highest test level is low. Yet, "stories" are a popular and frequent 
form of adult reading both in and out of school, and therefore, may 
J appropriately be included in advanced level tests. 

^ 3. A definition of the tasks necessary for supplying correct answers 

to questions . 

j The preceding task analysis has identified types of questions 

j 

found on current comprehension tests. Generally, reading comprehen- 
: sion questions require either "paraphrase" or "concept" tasks. 



1 

iGenerally , reading matter in the 1st and 2nd grade is concen- 
trated in school readers which contain mostly "stories." However, 
pupils in the 1st and 2nd grade are also taught some social studies 
and science. They may even do some reading in school about more 
"academic" topics. This leads to the question of curricular validity 
of tests which is the correspondence between test and curriculum con- 
tent (Kelley, et al , 1966, p. 23). Usually it is expected that the 
i test is designed according to the curriculum. However, Klein, (1970, 

j p.2) suggested that it is not uncommon for educators to modify a cur- 

riculum to correspond with tests. Thus, it seems appropriate for 
test authors as well as educators to study these questions. 

o 

rhe lowest level SAT had approximately 4% "science" selections. 
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"Paraphrase tasks" require pupils to pick answers which are re- 
statements of information explicitly given in the reading selections. 
"Restatement' is possible in a number of ways. For example, sometimes 
the answer is a picture representing the word(s). Sometimes the word 
is grammatically changed, e.g. different tense or number. Other times, 
different words with the same meaning are used. An additional influ- 
ence on "paraphrase tasks" is the context in which the information is 
presented. Sometimes the context of information given in the reading 
selection is essentially the same as the context of the same information 
in the answer, but not always. 

Figure 5 presents the "paraphrase tasks" found on the analyzed 
readirg comprehension tests. To summarize briefly, the following 6 
"paraphrase tasks" were identified: 

a/b. matching/selecting - the information was stated in a word(s); 
but the answer was a picture representing the same thing. ^ 

c. recognizing - the same word(s) was used in the reading 
selection and answer. The contexts of the selection and 
answer were also essentially the same. 

d. contextual paraphrase - the same word(s) was used in the 
reading selection and answer. However, the context of the 



lDue to the small number of picture answers, all questions that 
required picture-word matching were put into one category. However, 
there were really two types of items. In one type, matching, pictures 
represented the words exactly. In the other type, selecting, the 
picture either. added to or omitted from information described by the 
words. 
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information in the answer was different from the context 
of the same information in the reading selection. 

e. grammatical paraphrase - the word(s) used in the selection 
was grammatically different, e.g. tense, number, from the 
"same" word(s) in the answer. The contexts were also 
different. 

f. semantic paraphrase - the word(s) used in the selection 
were different from the words used in the answer; but, 
they both meant the same thing. The contexts were also 
different. 

Two types of questions do not appear in the reading comprehen- 
sion tests analyzed: 1 

a. grammatical change - the word(s) used in the selection is 
grammatically different from the "same" word(s) in the 
answer. The context is essentially the same. 

b. semantic change - the word(s) used in the selection is 
different than the word(s) used in the answer: but they 
both mean the same thing. The context is essentially the 
same. 



J-The value of such items lies in the possibility that they may 
facilitate the transition of learning to cope with progressively harder 
reading comprehension questions. For example, it may be that if match- 
ing is the simplest question task, selecting may be a bit more difficult, 
then recognizing, contextual paraphrase, grammatical change, grammatical 
paraphrase, and so forth would become progressively more difficult. 
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"Concept tasks" require pupils to choose answers which represent 
general or academic knowledge. The concepts are never explicitly 
stated in the reading selection. However, the selection gives some 
hints or cues. For example, sometimes generally known concepts are 
cued by descriptions of their features. Other times generally known 
concepts are cued by syntactic implications, e.g„ colloquialisms, 
idioms. On numerous occasions "academic" concepts are cued by their 
features, or by related concepts. An additional influence on concept 
tasks is the probability or certainty with which an answer is identi- 

v ' 

fied. For example, sometimes only one answer fits the cues. Other 
times one answer fits the cues only a little bit better than another. 

Figure 6, presents the 4 "concept tasks" found on the three reading 
comprehension tests analyzed: 

a. definite concept - features of the concept which are given 
in the reading selection clearly identify only one answer. 

b. probable concept - features of the concept which are given 
in the reading selection imply that one answer is probably 
better than another. 

c. (probable) language concept - the language structures in 
the reading selection suggest that one answer is probably 
better than another. This category generally applies only 
to questions in the form of cloze— lik^ib-lanks . 

d. (definite) previous knowledge - previous knowledge of 
"academic" facts clearly identifies one and only one answer. 
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These tasks did not appear on the reading comprehension tests analyzed. 
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Two types of questions do not appear in the reading comprehension 



a. (definite) language concept - the language structures in 
the reading selection definitely imply only one answer. 

b. (probable) previous knowledge - previous knowledge of 
,, academic n facts suggests one answer more than another but 
neither definitely. 

Generally, if the objective of the test is to establish how well 
pupils manipulate explicitly stated information, then "paraphrase" 
questions are appropriate. However # > if the objective of the test is 
to establish how well pupils manipulate f, general concepts," then "def- 
inite concept" or "probable concept" questions are more appropriate. 

If the objective is to establish pupils 1 fluency in English, "language 
concept" questions are more appropriate. And finally, if the objective 
is to establish pupils 1 knowledge of academic facts, then "previous 
knowledge" questions seem more appropriate. Whether questions testing 
language fluency or previous knowledge belong on tests of reading com- 
prehension is not clear. Apparently achievement tests in English test 
language fluency, and achievement tests in specific school subjects 
test knowledge of facts. The inclusion of such items on tests of read- 
ing comprehension has received the following criticism from Marks and 

1 Again, the value of such items would lie in the possibility 
that they could facilitate the transition of learning to cope with 
progressively harder reading comprehension questions. 



tests analyzed 
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Noll (1967, p. 346): 

Our intuitive notion of the comprehension 
task leads us to conclude that tests where 
scores are unduly influenced by specific 
previous knowledge or response biases are 
invalid measures of this ability. 

Similarly Guttman (1965) differentiated between "achievement" 
type items that would require previous knowledge of facts and "analytic 
ability" type items which would require the ability to analyze or 
manipulate given information. 

Finally, determining whether or not a given sequence of questions 
through the many test levels facilitates better performance may prove 
useful for both testing and teaching. 

4. A definition of the character of distractors in tests of reading 
compr ehens ion . 

Distractors were initially introduced into the testing of reading 
comprehension essentially to facilitate scoring and not to influence 
item difficulty. However, they generally do affect item difficulty 
and therefore, may obscure rather than clarify the meaning of reading 
comprehension test scores. Twelve types of distractors were identified 
(see Rating Scale for Choices , p. 89 ). Distractor combinations were 
often established during test construction by giving the questions to 
a trial population in open-ended form. The most frequent errors made 
by the trial population were later made distractors in the multiple- 
choice form of that test (California Test Bureau, 1957, p.6). However, 
the nature of the most frequent errors was not analyzed and their effect 
on item difficulty remained unknown. But, on the basis of the distractor 
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types identified in the preceding task analysis, it should be possible 
to diagnose the types of errors pupils make consistently and to control 
distractor difficulty. 

In conclusion, if 'leading comprehension" is to be a meaningful 
construct in teaching and testing, it seems to require a clear 
definition. Otherwise instruction of "reading comprehension" is simply 
a replication of instruction in science, history, or vocabulary . And,, 
testing "reading comprehension" is simply a combination of testing 
intelligence and numerous school subjects. Each test should focus on a 
specific objective and reduce the influence of extraneous factors. For 
example, tests in science could be simply worded reducing the influence 
of word knowledge. Tests in reading comprehension could provide all the 
subject matter information needed, reducing the influence of previous 
knowledge. Furthermore, if test-authors identify the particular 
combination of "selections," "questions" and "choices" which they consider 
"comprehension," the construct may develop defined features. For example, 
one test-author might focus on "story"selections , "paraphrase" tasks and 
"grammatical" distractors. Another test-author might prefer "academic" 
selections, "concept" tasks and "textual" distractors, and so on. 
Specifying objectives in this manner may help test-authors in construct- 
ing their tests. Descriptions of items may also permit teachers and 
administrators to decide more quickly and more knowledgeably if given 
tests are valid instruments for their purposes. 

Test-authors could also greatly facilitate the diagnosis and 
possibly treatment of pupils who fail tests by specifying how item 
difficulty is increased. For example, one test-author may increase 
the proportion of difficult words. Another test-author may increase 
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the ambiguity of the question and so on. 

Finally if literacy is a national priority and the attempt to 
j teach almost all citizens to read is continued, the "normal distribu- 

tion" model used in the design of current reading comprehension tests 
is inappropriate. According to this model prearranged proportions 
of the population are designated as doing very well, sufficiently well 
and "failing" on the test. Thus, a sizeable proportion of the national 
population achieves below "grade level" by definition. 

However, the use of the "criterion" model suggested above would 
not condemn a considerable portion of the population to failure. By 
j defining the "criteria" of reading comprehension, this model would 

I. 

facilitate not only a more meaningful evaluation of reading comprehen- 
] sion but would also facilitate the teaching of reading. 

j 
J 
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Some items had no questions. There was a selection in the form of a description or direction which 
indicated what the pupil was expected to do (see footnote "c" on Table 6 in Text, p. 57 ). 
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Mean, Standard Deviation, Minimum, Maximum and Range of the Number of 
Non-Dale-Chall Words in the Questions by Test 
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Mean, Standard Deviation, Minimum, Maximum and Range of the Number of 
Non-Dale— Chall Words In the Choices by Test 
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Invalid as grade scores. As noted above, the Spache formula was Intended by its author only for grades 1 to 3. The Dale-Chall 
formula was intended by its authors only for grades 4 and above. Consequently, the Spache grade scores for the intermediate and 
advanced test levels and the Dale-Chall grade for the lowest teat level only demonstrate the relationships existing among the test 
levels and between the readability formulae. 
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fl N indicates that selection scores were weighted by the number of questions that went with a given 
selection, e.g., if a selection had 4 questions the scores for the selection were included 4 times. 

^Sone questions had no selections, Just questions and choices, e.g., "Which is the big tree (CMRT, Level A 
item 2)?" The question wae followed by 4 picture choices, one of which was a big tree. In weighting scores 
"0"b were included &• selection data for such items . 
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Some questions had no selections, just questions and choices. e.g. f ,T Which is the big tree (GMRT. Level A 
item 2)7 "The question was followed by 4 picture choices, one of which was a big tree. In weighting scores 
"0"s were Included as selection data for such items* 



Table 16 



153 



u 

© 

B 

S3 

© 

4J CD 
© 

4-4 H 

° >* 
© -g 
60 

c s 

o 



5 



*3 w 

§ ft 

0 « 

i w 

-H M 
*H 

B -S 

82 



*8 

c ** 

®H 

o 2 

>Y 

*4 



o' 

l§ 

is 55 

4J 

CO 



c 

© 

a 



© 

6C 

G 

(2 



(0 

4-1 T3 
O M 
O 

M 15 

© 



4-1 

§ ° 
E U 
•h a 
C *9 

3 § 

C 



c 

*d o 

Vt t -4 
© u 
*a c o 
C nH 
© > 
4J a 
CO O 



CD 

"a 

4-f K 

*o O O 

c ^ 
© n 
a a o 

SI jd I 

e a 

3 1 
C C 

o 

13 



4J 

© 

a 

H 



o o 
o o o 



O O ^ 
CO 00 



o o 
o o o 



o m st 

ST LO 



O O 
O O O 



o m o 

H i h* 



CO *H 
o sr m 



O 04 O 
^ CO 



CO 
O ST 



o <N r- 

C0 *H 



mom 

f— < C0 <T 



U 

cu 



c 
m a 
© 0 



© no 
© 



© 

OHIO 

i-J W -< 



o o o 
o o o 



<1^4 0 

CM CM 



o o o 
o o o 



<f H N 
CM CM 



O 

O O O 



CJ HvO 

r*. *H m 



o m m 



co ^3- vD 

H H CO 



O \D SO 



^ CM CM 

co m m 



o o o o 
o o o o 



rH O O CM 
CM CM O 



o o o o 
o o o o 



H sD CM CD 
CM CM O 



o o 
o o o o 



O O CM ST 



PO vO NO M 
CM CO ST vO 



O CO vO 00 
CM 



h n n 



<: q t* 



a a a 

> > > 

© C) © 

l-J l-J l-J 



m so 



m co o m 
o ac co n 



O O co o 

H H sT 



CO O *T m 
CO VO NO VO 



M 
M M 

© © 

4J 4J «H 

© © o 

HH O 
•O no jz 
“ © a 
S to 



Sn a 
M s 
© u 
B o' 

rl U U CJ 
V> c C -H 
Pi H H PC 



• • • O 

N CO Ov H 



C 

( 1 ) 

> 

CO 



to © 


*** u 


B 


pC 


-C -H 


© 00 


4 J 4 J 


© 1-1 




u © 


> ST 


4 J > 


4 J -a 


to c 


G © 


i-« M 


© *o 


pO 


> 3 




H 


© • 


4 J M 


pC © 


© C 


‘ 4 J © 


pC -H 


M 


4 J 


© 4 -i 


© 


1-4 


to M 


00 


C © 


,43 i-l 


o > 


o -n 




i-t 


4 J G 

m o 


g «, 


© «H 


z o 


3 4 -» 


* cO 


cr a 


• 3 


© 


60 


44 »H 


• pC 


O © 


© a 


to 


i-i 


u 


*pC 


© © 


© 3 


^ pC 


© 


E 


a 4 -i 


3 


nr 4 O 


c u 


O 


o 


pC © 


© 4 -i 


a c 


JZ 


o 


u a 


*3 



( 1 ) 

S"» r* 

-n o 
a 

no cs 
© 

4 J © 

x: .c 

co u 

i-< 

© CO 

> c 



a) -rC 
M 4 J 
(1) GO 
> © 



C 

o 

tf 

o 

a 



© a 
0 
4J 

cO cd 

-C 

4J 4-4 
tH 
O 

<y • 

U • 
eO 60 
O • 
«H © 

*o 

c • 

■H c 

o 

3 4J 

a 

© 



c 

co 3 
3 
10 o 
G «H 
o o • 
,© © 
y B 

CD © 

© © v» 
3 M *rl 
O* 3 
<-> ,G 
4J O U 
C rl d 
3 G. © 
■n 

ST M 



© cr 


* O 


© 


© PS4 4 


Vl iT 


C ^ 


o 


O © 


O *3 


H*d o 


© © 


4J © © 


.G 


a ^ *3 


C 


© o 



© »H O 
BOH 
4-4 4 J 

o o 

co© 

© H 

*o ^ © 
<0 o 
•c c 

o a 

• «H © 

C w 

0 a *0 
H O ffl 
U 3 d 
CO O* 3 
© r-i 

3 © a 

01 g 5 
© 

2 = O 
© M 
Od © 

pQ CM ^ 

0 « 

© 3 
4 J O 
S 



o 

ERLC 



109 



X 

td 

H 



o 

ERIC 



td 

6 

0) 

td 

p 

03 

d 

td 

0) 

42 

o 

td 

O- 

CO 44 
CO 

0) 0) 
X H 

44 

>» 
Vl X> 

o 

44 CO 

d 

Q) O 
60 *H 
d 44 

o 
0) 
rH 
03 0) 

d co 
td 

60 

• d 

3 03 

e co 

3.S. 



a 



a 

d 

*H 



u 
* td 
d 
o 

*H 

44 

43 

% 

P 

03 

Vi 

td 

03 

d 

3 

CO 



d 

3 

a 



O 

•H 

44 

a 



td 

6 

0) 

td 

P 



o 

•H 

44 

a 

<U 

X 

o 

td 

cv 

co 



0) 

60 

d 

s 



S 

d 

.9 



§ 

$ 

d 

a 



p 

CO 



d 

3 

a 



0) 

60 

d 

a. 



a 

3 

a 

X 

a 



5 

d 

*H 

a 



p 

CO 



d 

id 

0) 

a 



53 



vo r^. 

o cm m 



o O' oo 



oo <r 
o oo m 



o co N 

»H CO 



cm r» 

o vo on 



o O' oo 



-3- <r 
o co 



O CON 



o m 
o m co 



o m vo 

rH CM 



CO CM O 
n <T VO 



O rH O' 



o co m 

CM CM 



r- O CO 

m co <■ 



*'d‘ CO 



>> 

Vi 



CM O' MO 
H 00 M 



N 00 N 
rH CO CM 



CM ON 'd' 
rH 00 CM 



cm oo m 
H C0<t 



00 

O O 



O O 



m o o 

00 ON rH 



CM rH co 

co m co 



O CO oo 
rH CM 



CO CO 00 
CO O 00 



CO CM CM 
CO CO CM 



m m no o 

CM CM co m 



vo rH CM o 
CO M co 



m m r- cm 
CM CM in oo 



vo rH 00 o 

CO <f 



VO 

O O P*"* CM 



o o m o 



rH *3> o CM 
<r vo N N 



H N N CO 



CO CO M 1 H 
CO 00 rH CM 



O N N s} 
rH rH CM 



o VO vo OO 
O ON co 00 



m m vo oo 

rH CM C0 rH 



















X 


03 


• u 


CO 


o 




CO M CO 


o 


VO 


o 


ON 




0) 


0) 42 


m 


m 


CO 


co m oo 


o 


>3* 


o 


CO 


3 


no 


60 


• 


• 


• 


» • • 


• 


• 


• 


» 


d 


d 


« *H 


o 


«<r 


m 


co o m 


m 


OO 


o 


ON 




rH 


CD 0) 


rH 


CO 


CO 


co ^ mt 


rH 


CO 


m 


co 


a) 


o 


O > 


















42 


d 


O 


















v> 


•H 


•h d 






















O H 


















>> 


QJ 


42 


















X 


U 


O 




00 


rH 


rH m 




o 


<3- 


rH 




0) 


• 


o 


o 




O in On 


o 


m 


VO 


m 


03 


> 


02 0) 


• 


• 


• 


• • • 


• 


• 


• 


• 


0) 




d a) 



O 00 CM 
CM 



vo in vo 

ON "d ON 



o CM CO o 

lH rH CM 



00 in O H 
m <r o vo 



vt vo co in 



m 


00 


00 


m 


o 


<3- 


vO 


CM 


r>- 


ON 


CO 




« 


> 


ON 


**3" 


r» 




CM 


oo 


vO 


o 


CM 


m 




03 


CO 






• 


• 


• 


• 


• 


• 


• 


• 


• 




0) 


d 


4-1 


-3- 


00 


ON 


m 




in 


CO 


CM 




rH 


d 


U 


o 


O 




CM 


CM 




CM 


CO 




CM 


CM 


CO 


o 


o 


•H 
























•H 


O 


44 


0) 






















4-1 


CO 


O 


d 






















O 




0) 


o 






















QJ 


0) 


rH 
























rH 


42 


0) 


* 


m 


O 


m 




CM 


CM 


00 


O 


^3- 


m 


0) 


4-> 


CO 


CO 


rH 


CO 


<■3- 


CO 


m 


m 


CO 


vO 


VO 


n£> 


(0 






0) 



43 43 







Vi 


td 










•o 02 


42 


•H 


•H 


H 


T2 






P 


•H a) 




<tf 


P tH 




<y a) 


a 


n2 


U 




CJ 








d o 


H 








vis 


co 


d 


CO 


0) 


> 




H 


Vi 


o c 


pi 


rH 


rH *H 


H 


<d P C 






0) 


a 


o 


4-> 


<\ 


0) 


e td 


2 


0) 


a) a) 


< 


E a) a) 


4d 




d 


o 


rH 


CO 


o 


> 


o > 


o 


> 


> > 


oo | 


■H 4-» U 


60 




C7 4 


CO 


rH 


V 




Q 


rH 02 




0) 


a) a> 




u d d 


*H 


td 




42 


o 


H 




P 


W < 




p 


rH *H. 




P4 H N 






>3- 




4-1 




















• 




02 




00 






• 


• • 




• 


• • 




* • • 


o 




td 




td 






rH 


CM CO 




«*3* 


m vo 




7 

8 
9 


rH 




42 




* 



d 

O 

•H 

44 

O 

0) 

rH 

0) 

UJ 



0) 

3 to 
a 4 Q) 



Vi -d 
0) 



h d 
x o 

60 *H 
*H V» 
0) O 
£ 0) 
rH 

0 ) 0 ) 
Vi CO 



44 
44 *H 

44 44 

•H 

0) > 
0) 

u tn 

«• g 



tn 

g 8 

Ti V* 
4J *H 
tn 

a) jd 
d o 
cc s 
u> 

Q) 

42 Vi 
H O 
= U-l 

td 



3 i 



60 
60 -S 



td Vi 

44 

0) 

d 60 
O *H 
•H X 

44 

tn td 
a) 

3 CO 

cr td 

4J 

tn X 
3 O 



o a 
d *h 
o 

03 42 



td 

42 



X 



159 



44 


CM 


44 


*H 




td 


- * 


0 02 


• 0) 


QJ 




60 60 


44 


c 


• td 


«H 


o 


a) cu 




•H 




A 


44 


» 0) 




O 


d a) 




QJ 


o to 


rH 


rH 


•H 


0) 


0) 


4-i d 


> 


CO 


o o 


0) 




a) *h 


p 


to 


rH 4-1 




td 


o) td 


A 




CO O 


■H 


02 


•H 


s 


0) 


d 4-1 


•^4 


02 


0) H 


o 


d 


> u 




rH 


*H cd 
60 H 


0) 


o 

d 


o 


0) 


*H 


td 


Vi 




Vi 


44 


0) 


42 QJ 




Vi 


4-> 42 


60 


G) 


•H 4-» 


■H 


> 


> Vi 


X 




d 




tn 


CO 44 


0) 


z 


d 


42 O 


O Vi 
■H O 


44 


z 


V» 44 


CO 


9k 


cn ^ 


•H 


tn 



179 



Table 18 



5 



cd 

JZ 

u 

I 

g 

H 

cd 

Q 

G 

-C 



w 

o 



g 

00 AJ 
C w 

rt g 

<2 H 

T3 

c 

nJ m 

to 

% J 

x « 

rt « 

- 3 

. w 

§ to 

g c 

.2 -H 

c V 

g s 

s OS 



55 



S ot 
<S 2 

■g 8 

*G 

d 

t« 



d 

8 

x 



G 

ec 

a 

»3 



Cl 

E V 
9 o 
£ u 

•H LO 

X 

•3 & 

X CO 



Cl 

E Al 

e S 

*H tO 

C 

•H £ 



<3 



vO VA CO 
O vO ON 



4HC\) 



8 



H CA 
VACO 



4NO 

H 



vO VA 

O CO CO 

• • • 

OIaN 



c 

'O o 

J-l *H 
Cd AJ 

♦o nj 
d *h 
*d > 
u o 
Vi Q 



Cl 

Wi 

I JO O 

c u 

cd CO 

Cl 

X 3 
.cd 

os 



0 ) 

Cl 

H 



cm va cm 
O v vo cm 



H O HI 



CM CO CM 
vO CO CO 



CM vO CO 



2&23 



O C— CM 
H vO O 

• • • 

vo vo -d 



ovcq vaco 
o H o o 

• • • » 

H va c-va 



o t>- r- 

Hvo4 

• mo 



OnHNO 

CO CM H CA 

• • « • 

—4 On CM j— j 



o 823 

• o • 

O -4 t>- 



O CA CM CM 
CO O H CM 

• • • • 

CA-4 Va vO 



VA CA CM 
CM vO O 

o o • 

CM H H 



CA H co O 
CM CM CM CA 

« « • « 

O H H H 



CA H 
O VA <A 

m • • 

CM vo On 



vn, cm NO cm 
CAVA CACO 



CAVO C^-CO 



-4“ CM CJ 
CAVA VA 



CO O 4lA 

CAVO vo vo 



u 


















o 


G 




«d 


















AJ 


AJ 


H 


B 


fr 
















cd 


CO 


O 


•H 














M 


•H 


•H 


O 


U 


cd 


"O 














'O 


T 3 


JZ 


P 4 


AJ 


01 




< 


a 


Cu 




>N 


Cl 


Cl 


u 




c 


G 


$ 








H 


u 


G 


e 


CO 


U 


Cl 


C 


H 


*— i 


H 


< 


cd 


u 


u 




Cl 


G 


<d 


o 


o 


Cl 


Cl 


CO 


e 


Cl 


G 


X. 




Cl 


> 




> 


> 


> 






AJ 


AJ 


CO 


0 


H 


•G 




Cl 


Cl 


01 




U 


c 


d 


*H 


1-3 


W 


-< 




•4 


.J 






P 4 


M 


M 


P 3 


• 


• 


• 




• 


• 


• 




* 


• 


• 


• 

o 


1 

2 


CO 






5 

6 






8 

9 


H 



> < 



4J QJ cd 



Cl 


o 


o 


O 


H 


G 


c « 


G 


G 


r| 


cd 


H 


to 


G 


'd > 


O 




<0 


cd 


o 


AJ 




JS c 




« 


<* 


o 


to 


x 




C8 H 


cd 


AJ 




C AJ 






•V't 


o a 


•d 


CO 




•H G 


G 


Cl 


* 


AJ 3 




u 


m 


co cr 


3 


cd 


to 


G 


H 


u 


• 


3 G 


U 


•H 


G 


°“e 


a 


•d 




vl 


C 


Wk 


G 




•H 


C 


S r 


to 




0 


O CW 


u 




•H 


CO 


o 


d 


AJ 


4) CM 


> 




t> 








G 


0 


o 




H 


G 


* 




G 


AJ 


p 




m 


M 





o 

ERLC 



171 



Mean, Standard Deviation, Minimum, Maximum and Range for the 
Spache and Dale-Chall Ratios in the Questions by Test 
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Note 1: A hyphen means that one score was constant throughout all Items of the test and therefore the correlation was 
meaningless. 



r>> o •-* 



/H r -4 



S <0 O 
O N CM 

I I I 



^ « CO N 



►> 

4J 

H 

3 



O r*. «* 



I-* 

? 



«o m cm 



n o n <o N 
«N O CM © CM 



*-« Ov 00 



JK <U 

so 



si 



40 fcO 

? " 



n h © 

N N r» 

I t I 

<r H o 
oo co 00 

I l l 



cm n m m o 

N H ^ H 



vo «o mt 

T ? T 



O H n H 

no v© o n 



H W ® 
at uo oo 

l J l 



r-* n «a v© 
©\ to a> oo 



mt oo cn o 

N C3 Cj! N 



tO to to 



OO 

H 

I 



r-» to <t 
N N N 

I I I 



oo oo o to 



e i 

b 



6 

I 



d 

I 



CO © u 



cm m «♦ »o 



S 3 



3 3 



O 

5 * 

1 - 

9 «l 

• £ 

4= i 

u o o 

• H 



U 

o 

» g £= 

«M O O 

o a a 

* • • 

o o o 

2E tZl £ 



V • 

fU 

u 

% ~ 
In 

c . 

5 * 

H 

tO 



e 

i 



bi u u 

u> <© o 



! ° 

-IKiC 



177 



Tabl« 



167 



i 



r 



3 



in ro ro 
O o O 



in m so o o 



in 10 CM 



CO sO SO 



•M 1 ) 
O 

° S 

•o > 
C *0 
■J < 



fdl 



r*. co 
n ro 

t I 



*3“ CO *0 

■H O O 

I I I 



m >o co 
n n 

f I 



co uo 

m <r 

I I 



n « n 
«h o <r 
I I 



I I 



.H 0> 



CO sD 
•H »H 
I 



ro uo 

o o 

I I 



CO o so 
N Cl H 

i i » 



SO O IN 



m m 

*— < rH 

I 



CO IN CO CO 
o O CN rH 

till 



c D 



* 5 

« U 

JZ | 



JZ 

V 



«9 

6 



JZ 

V 



3 

I 

a 



s v 

t 54 



6 

I 



25 25 < 



5 5 



25 £S SC tO « 



m to r» 



O r-* N 
n n ci 




178 



Vofx The number of items In this test was 45. The critical correlation for 44 degrees of freedom at two signflcance levels 
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Note Is A hyphen means that one score vaa constant throughout all items of the test and therefore the correlation 
meaningless. 
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The number of items in this test was 64. The critical correlation for 63 degrees of freedom at two significance 
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Note: The number of items in this test was 65. The critical correlation for 64 degrees of freedom at two significance 
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APPENDIX B 



I 



! 

! 

I 

I 

! 



i 



i 




COLLECTION OF ODD SOUNDING SENTENCES 



1 . 

2 . 

3. 

A. 

5. 

6 . 

7. 

8 . 

9. 

10 . 
11 . 



12 . 



13. 



1A. 



It Is a something . ( SAT- P1-1) * 
It is a little. (SAT-P1-1) 

It can go see . (SAT-P1-2) 



It 


can go want. 


(SAT-P1-2) 


It 


can go blue. 


(SAT-P1-2) 


It 


is a pretty. 


(SAT-P1-5) 


We 


are at here. 


(SAT-P1-6) 



We are at fun . (SAT- P1-6) 

His nose was big and sleepy . (SAT- P1-23) 

Pete is a house . (SAT-P1-35) 

If smallpox virus should enter the air of a vaccinated child, 
the substance is there to prevent, the virus from doing any 
damage. (SAT-11-19) 



The name of the star Procyon means "before the dog," and it was 
so named because it rises just in advance of Procyon Sirius. 
(SAT-I1-2A) 

If, on the other hand, it stands together in a field or park, it 
spreads out much more, and growth is not so restricted to 
height. (SAT-11-30). 



In spite of the general increase in the cost of real estate, I 
am sure the looks of his home has gone down. ( SAT -I1-5A) 



Note: Distractors are underlined. 

^(test battery - test level - question number) 

SAT = Stanford Achievement Test . Form X, Paragraph Meaning Subtest 
Pl= Primary 1 12= Intermediate 2 

II*? Intermediate 1 HS= High School 

GMRT =Gates-MacGinitie Reading Test , Form 1, Comprehension Subtest 
D= Survey D F= Survey F 
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r 

15. The other parts of the spot can still see, and the part which 

{ sees nothing leads to the impression that there is a black 

spot floating in the air. ( SAT -12-13) 

j 16. The other parts of the light can still see, and the part which 

| sees nothing leads to the impression that there is a black 

spot floating in the air. ( SAT -12-13) 

17. In Roman times Latin was unknown by the most important people 
then living on the face of the earth. (SAT-12- 14) 

18. The smaller the space to be occupied by the gas, the greater must 
be the applied water . ( SAT -12-22) 

19. The smaller the space to be occupied by the gas, the greater must 
be the applied pump . ( SAT -12-22) 

20. One should not confuse the number of light waves per second, or 
the frequency of the air , with the rate at which light is 
traveling. ( SAT -12-52) 

j 21. The moon also travels around the earth in perihelion. ( S. v T- I2-56) 



22. Good thought, like good reading, demands a sharp precision between 

what is important and what is unimportant. ( SAT -HS-1) 

23. Good thought, like good reading, demands a sharp evaluation be- 

tween what is important and what is unimportant. ( SAT -HS-1) 

24. But when they are the reverse, one can always form an unfavorable 

opinion of him, because his first mistakes are in making 
these opinions . ( SAT- HS-16) 

25. Study in school is an activity that has as one of its chief natures 

the mastery of school subjects. (SAT-HS-17) 

26. This mastery is observed by grades, diplomas, vocational success, 

status, arid approval from others. ( SAT -HS-18) 

27. The values of such reinforcements induces the student to under- 

take and carry out study activities. (SAT-HS-19) 

28. This energy is produced, not by blowing apart the heavy elements 

as in fission, but by focusing of light elements. ( SAT -HS-43) 

29. The children were very empty . ( GMRT -D-1) 

30. "There's a good strong wind bellow ." said Dave. ( GMRT -D-5) 

31. "There's a good strong wind belong. 11 said Dave. ( GMRT -D-5) 
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32. ’’There’s a good strong wind yesterday ," said Dave. ( GMRT -D-5) 

33 . As it is, they look so much like the surrounding snow that hunters 

often do not see them um.il they melt . ( GMRT -D-8) 

34 . As it is, they look so much like the surrounding snow that hunters 

often do not see them until they aren’t . ( GMRT- D- 8 ) 

35 . The porter who makes up the beds on a train has other wise too. 

(GMRT-D-9) 

36. For example, he helps the passengers with their comfortable as 

they arrive at their destinations. ( GMRT- D- 10) 

37 . They do not own the foreshore, that strip of time lying between 

the high-water and low-water marks . (GMRT-D-13) 

38. They do not own the foreshore, that strip of land lying between 

the high-water and low-water storms . ( GHRT -D-14) 

39. When flowers . it beats its wings so rapidly that they sound like 

the hum of -a tiny motor . ( GMRT- D- 15) 

40 . As one looks down a long, straight road, it seems to grow nar- 

rower in the time . ( GMRT -D- 17) 

41. As one looks down a long, straight road, it seems to grow 

narrower in the turnpike . ( GMRT -D-17) 

42. Telephone poles give the distance of growing smaller as the eye 

follows a row of them toward the horizon. ( GMRT -D-18) 

43 . Telephone-poles give the score of growing smaller as the eye 

follows a row of them toward the horizon. ( GMRT- D-18) 

44. Telephone poles give the call of growing smaller as the eye follows 

a row of them toward the horizon. ( GMRT- D-18) 

45. Telephone poles give the height of growing smaller as the eye 

follows a row of them toward the horizon. ( GMRT -D-18) 

46. Prior to this it was thought idea for a man to run a "four- 

minute” mile. ( GKRT -D-19) 



195 



185 



47. Then in 1961 Herb Elliott of Austrailia ran the mile in three 

timc 3 , fifty-four and a half seconds. ( GMRT -D-20) 

48. He bettered Bannister's right by nearly five seconds. (GMRT-D-21) 

49. He bettered Bannister's timely by nearly five seconds. (GHRT -D-21) 

50. "Turnpike" is one name given to those highways where travelers 

must pay told . (GHRT -D-22) 

51. "Turnpike" is one name given to those highways where travelers 

must pay roads . ( GMRT -D-22) 

52. All buildin gs using the turnpikes go through toll gates and there- 

by share the cost of good roads. ( GMRT -D-23) 

53. All necessary using the turnpikes go through toll gates and there- 

by share the cost of good roads. ( GMRT -D-23) 

54. All read> using the turnpikes go through toll gates and thereby . 

share the cost of good roads. (GMRT -D-23) 

55. All without using the turnpikes go through toll gates and thereby 

share the cost of good roads. ( GMRT -D-23) 

56. Jet planes now cover -the Atlantic Ocean take only a fraction of 

the time that Lindbergh took. ( GHRT -D-25) 

57. Jet planes now enter the Atlantic Ocean take only a fraction of 

the time that Lindbergh took. ( GKRT- D-25) 

58. Jet planes now g oing the Atlantic Ocean take only a fraction of 

the time that Lindbergh took. ( GMRT -P-25) 

59. Jet planes now crossing the Atlantic Ocean take only a double of 

the time that Lindbergh took. ( GMRT- D-26) 

60. Jet planes now crossing the Atlantic Ocean take only a passing 

of the time that Lindbergh took. (GMRT-D-26) 

61. To receive the money, he must show proper face . ( GMRT -D-31) 

62. To receive the money, he must show proper own . ( GMRT- D-31) 

63. If the air ways increases to much more than sixteen pounds per 

square inch, the whole world seems to be pressing down and 
trying to suffocate you. ( GHRT -D-32) 

64. As they paddled in to the lakeshore, they saw the log cut which 

was to be their headquarters for the trapping season. ( GMRT- D-35) 
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65. 

66 . 

67. 

68 . 

69. 



70. 



71. 



72. 

73. 



74. 



75. 



76. 



"Couldn't be better scene ," said Don. ( GMRT- D-36) 

"Couldn't be better tree, " said Don. ( GMRT -D-36) 

"Couldn't be better season. " said Don. ( GMRT -D-36) 

More time than he could have saved would now be locked trying to 
get his bearings. ( GMRX -D-40) 

Mora time than he could have saved would now be sent trying to 
get his bearings. (GMRT -D-40) 

Championship diving is the importance of such specifics as 
muscular control and coordination plus exact timing. 
(GMRT -D-42) 

Championship diving is the s pring of such specifics as muscular 
control and coordination plus exact timing. ( GMRT -D-42) 

Championship diving is the reading of such specifics as muscular 
control and coordination plus exact timing. ( GMRT- D-42) 

Championship diving is the result of such specifics as muscular 
springboard and coordination plus exact timing. ( GMRT -D-43) 

In 1959 the reverse side of the Lincoln cent was massed . 
(GMRT -D-45) 

The wheat heads were published by a front view of the Lincoln 
Memorial, situated in Washington, D.C. ( GMRT -D-46) 

The wheat heads were registered by a front view of the Lincoln 
Memorial, situated in Washington, D.C. ( GMRT -D-46) 



77. The wheat heads were re\*ersed by a front view of the Lincoln 

Memorial, situated in Washington D.C. (GMRT-D-46) 

78. A windshield made of s teel glass is relatively safe because the 

plastic layers have an elastic quality which prevents broken 
glass from shattering and causing injuries. (GMRT -D-48) 

79. ■ A windshield made of laminated glass is relatively safe because 

the plastic layers have an elastic quality which each broken 
glass from shaticering and causing injuries. ( GMRT -D-49) 

80. A windshield made of laminated glass is relatively safe because 

the plastic layers have an elastic quality which tries broken 
glass from shattering and causing injuries « ( GMRT -D-49) 
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81. A windshield made of laminated glass is relatively safe because 

the plastic layers have an elastic quality which encourages 
broken glass from shattering and causing injuries. ( CMRT -D-49) 

82. A windshield made of laminated glass is relatively safe because 

the plastic layers have an elastic quality which caus es broken 
glass from shattering and causing injuries. ( GMRT -D-49) 

83. Language changes through the return of new words and the dropping 

of old ones. ( GMRT- F-5) 

84. These changes in language often plan changes in conditions within 

the community. (GMRT-F-'6) 

85. These changes in language often forego changes in conditions 

within the community. ( GMRT -F-6) 

86. Though a few minutes earlier I had felt that I could walk no 

further, the sight of the sparse landmark, the solitary tree, 
tonight silhouetted against the wintry sky, caused me to 
quicken my pace. ( GHRT -F-7) 

87. Though a few minutes earlier I had felt that I could walk no 

further, the sight of the" familiar landmark, the solitary tree, 
tonight grouped against the wintry sky, caused me to quicken 
my pace. (GMRT-F-S) 

88. By fixing an individuals place in society at birth, the caste 

system prevented many talented people from desirable positions 
where they could use their abilities for the benefit of the 
nation. ( GKRT -F-16) 

89. By fixing an individuals place in society at birth, the caste 

system prevented many talented people from successful positions 
where they could use their abilities for the benefit of the 
nation. (GMRT-F-16) 

90. A foreign populated district in the North of Scotland is entitled 

to its programs as much as an industrial area. ( GMRT -F-17) 

91. A Scottish populated district in the North of Scotland is entitled 

to its programs as much as an industrial area. ( GMRT -F-17) 

92. A British populated district in the North of Scotland is entitled 

to its programs as much as an industrial area. (GMRT-F-17) 

93. Immediately I. knew her whom he spoke. ( GMRT- F-24) 

94. Oxygen can be prepared in the laboratory by combining potassium 

chlorate. ( GMRT -F-29) 
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95. In such cases it is conceivable that the occurrence of large 

droplets into the base of the clouds or of artificial freezing 
bodies into the tops of the clouds might cause precipitation 
or at least hasten its occurrence. ( GMRT -F-32) 

96. In such cases it is conceivable that the elimination of large 

droplets into the base of the clouds or of artificial freezing 
bodies into the tops of the clouds might cause precipitation 
or at least hasten its occurrence. (GMRT-F-32) 

97. In such cases it is conceivable that the cluster of large droplets 

into the base of the clouds or of artificial freezing bodies 

into .the tops of the clouds might cause precipitation or at 

least hasten its occurrence. ( GMRT -F-32) 

98. This was most likely to occur in large, economically complex 

societies marked by unequal distribution of wealth and control 
by an active poverty . ( GMRT- F-39) 

99. For a man to be on good terms with himself and his neighbors, he 

must live in a society of equals where he depends not on the 
caprice of a strong and wealthy minority but on sovereigns 
applying to all members of the community establishing them. 
( GMRT -F-40) 

100. j? or a jaan to be on good terms with himself ana his neighbors, he 

must live in a society of equals where he depends not on the 
caprice of a strong and wealthy minority but on nations apply- 
ing to all members of the community establishing them. 
(GMRT-F-40) 

101. Some of Darwin's conclusions were so odd to accepted beliefs that 

they were condemned as absurd, contrary to common sense. 
(GMRT -F-41) 

102. Some of Darwin's conclusions were so contrary to accepted beliefs 

that they were condemned as often , contrary to common sense. 
( GMRT -F-42) 

103. Some of Darwin's conclusions were so contrary to accepted beliefs 

that they were condemned as completely , contrary to common 
sense. ( GMRT -F-42) 

104. Goods, objects with enjoyment of fulfillment are the natural 

fruition of the discofery and employment of means when 

the connection of ends with a sequential order is determined. 

( CMRT- F-51) 
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Goods, objects with thoughts of fulfillment are the natural 
fruition of the discovery and employment of means when the 
connection of ends with a sequential order is determined. 
( GMRT -r-51) 

Goods, objects with uses of fulfillment are the natural fruition 
of the discovery and employment of means when the connection 
of ends with a sequential order is determined. ( GMKT -F-51) 
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SEMANTIC DIFFERENTIAL 
I 

Directions: Rate only the ideational character of the content, 

avoiding the influence of any other variables. 



familiar 1234567 

little 1234567 

intellectual 1234567 

simple 1234567 

interesting 1234567 

profound 1 2 3 4 5 6 7 

easy 1 2 3 4 5 6 7 

subtle 1234567 

earnest 1234567 

abstract 1234567 

clear 1 2 3 4 5 6 7 

strong 1234567 

personal 1234567 

masculine 1234567 

emotional 1234567 

pleasant 1234567 

serious 1234567 

good 1 2 3 4 5 6 7 

precise 1234567 

informative 1234567 

formal 1234567 

general 1234567 



unfamiliar 

much 

un intellectual 

complex 

boring 

superficial 

hard 

obvious 

flippant 

concrete 

hazy 

weak 

impersonal 

feminine 

unemotional 

unpleasant 

humorous 

bad 

vague 

uniformative 

informal 

technical 
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SEMANTIC DIFFERENTIAL 
II 



Directions : 


Rate only 
influence 


the language 
of any other 


of the selection avoiding the 
variables. 


intellectual 


i 


2 


3 


4 


5 


6 


7 


unintellecti 


easy 


i 


2 


3 


4 


5 


6 


7 


hard 


subtle 


i 


2 


3 


4 


5 


6 


7 


obvious 


succinct 


i 


2 


3 


4 


5 


6 


7 


wordy 


earnest 


i 


2 


3 


4 


5 


6 


7 


flippant 


clear 


i 


2 


3 


4 


5 


6 


7 


hazy 


strong 


i 


2 


3 


4 


5 


6 


7 


weak 


personal 


i 


2 


3 


4 


5 


6 


7 


impersonal 


masculine 


i 


2 


3 


4 


5 


6 


7 


feminine 


emotional 


i 


2 


3 


4 


5 


6 


7 


unemotional 


pleasant 


i 


2 


3 


4 


5 


6 


7 


unpleasant 


serious 


i 


2 


3 


4 


5 


6 


7 


humorous 


florid 


i 


2 


3 


4 


5 


6 


7 


plain 


good 


i 


2 


3 


4 


5 


6 


7 


bad 


precise 


i 


2 


3 


4 


5 


6 


7 


vague 


familiar 


i 


2 


.3 


4 


5 


6 


7 


unfamiliar 


little 


i 


2 


3 


4 


5 


6 


7 


much 


simple 


i 


2 


3 


4 


5 


6 


7 


comp lex 


interesting 


i 


2 


3 


4 


5 


6 


7 


boring 


general 


i 


2 


3 


4 


5 


6 


7 


technical 


formal 


i 


2 


3 


4 


5 


6 


7 


informal 
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SEMANTIC. DIFFERENTIAL 
III 

\ 

Directions: Rate only the affective character of the content avoiding 



the influence of such variables as ideas and language. 



thoughtful 


i 


2 


3 


4 


5 


6 


7 


thoughtless 


simple 


i 


2 


3 


4 


5 


6 


7 


comp lex 


profound 


i 


2 


3 


4 


5 


6 


7 


superficial 


little 


i 


2 


3 


4 


5 


6 


7 


much 


subtle 


i 


2 


3 


4 


5 


6 


7 


obvious 


familiar 


i 


2 


3 


4 


5 


6 


7 


unfamiliar 


clear 


i 


2 


3 


4 


5 


6 


7 


hazy 


strong 


i 


2 


3 


4 


5 


6 


7 


weak 


personal 


i 


2 


3 


4 


5 


6 


7 


impersonal 


masculine 


i 


2 


3 


4 


5 


6 


7 


feminine 


pleasant 


i 


2 


3 


4 


5 


6 


7 


unpleasant 


serious 


i 


2 


3 


4 


5 


6 


7 


humorous 


good 


i 


2 


3 


4 


5 


6 


7 


bad 


precise 


i 


2 


3 


4 


5 


6 


7 


vague 


affected 


i 


2 


3 


4 


5 


6 


7 


genuine 
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